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ABSTRACT 



The invention provides a method and materials for analyzing 
the frequency of sequences in a population of 
polynucleotides, such as a cDNA library. A population of 
restriction fragments is formed which is inserted into vectors 
which allow segments to be removed from each end of the 
inserted fragments. The segments from each restriction 
fragment are ligated together to form a pair of segments 
which serves as a tag for the restriction fragment, and the 
polynucleotide from which the fragment is derived. Pairs of 
segments are excised from the vectors and ligated to form 
concatemers which are cloned and sequenced. A tabulation 
of the sequences of pairs provides a frequency distribution 
of sequences in the population. 

8 Claims, 1 Drawing Sheet 
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GENE EXPRESSION ANALYSIS 

This is a continuation-in-part application of U.S. patent 
application Ser. No. 09/028,128 filed Feb. 23, 1998 now 
U.S. Pat. No. 6,054,276, which is incorporated by reference. 5 

FIELD OF THE INVENTION 

The invention relates generally to methods and composi- 
tions for quantitative analysis of gene expression, and more 
particularly, to methods and compositions for accummulat- 10 
ing and analyzing sequence tags sampled from a population 
of expressed genes. 



BACKGROUND 
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The desire to decode the human genome and to under- 
stand the genetic basis of disease and a host of other 
physiological states associated differential gene expression 
has been a key driving force in the development of improved 
methods for analyzing and sequencing DNA, Adams et al, 
Editors, Automated DNA Sequencing and Analysis 20 
(Academic Press, New York, 1994). The human genome is 
estimated to contain about 10 5 genes, about 15-30% of 
which — or about 4—8 megabases — are active in any given 
tissue. Such large numbers of expressed genes make it 
difficult to track changes in expression patterns by available 25 
techniques, such as with hybridization of gene products to 
microarrays, direct sequence analysis, or the like. More 
commonly, expression patterns are initially analyzed by 
lower resolution techniques, such as differential display, 
indexing, subtraction hybridization, or one of the numerous 30 
DNA fingerprinting techniques, e.g. Vos et al, Nucleic Acids 
Research, 23: 4407^414 (1995); Hubank et al, Nucleic 
Acids Research, 22: 5640-5648 (1994); Lingo et al, 
Science, 257: 967-971 (1992); Erlander et al, International 
patent application PCT/US94/13041; McClelland et al, U.S. 
Pat. No. 5,437,975; Unrau et al, Gene, 145: 163-169 (1994); 
Hubank et al, Nucleic Acids Research, 22: 5640-5648 

(1994) ; Geng et al, BioTechniques, 25: 434-438 (1998); and 
the like. Higher resolution analysis is then frequently carried 
out on subsets of cDNA clones identified by the application 

of such techniques, e.g. Linskens et al, Nucleic Acids 40 
Research, 23: 3244-3251 (1995). 

Recently, two techniques have been implemented that 
attempt to provide direct sequence information for analyzing 
patterns of gene expression. One involves the use of 
microarrays of oligonucleotides or polynucleotides for cap- 4 $ 
hiring complementary polynucleotides from expressed 
genes, e.g. Schena et al, Science, 270: 467^69 (1995); 
DeRisi et al, Science, 278: 680-686 (1997); Chee et al, 
Science, 274: 610-614 (1996); and the other involves the 
excision and concatenation of short sequence tags from 50 
cDNAs, followed by conventional sequencing of the con- 
catenated tags, i.e. serial analysis of gene expression 
(SAGE), e.g. Velculescu et al, Science, 270: 484-486 

(1995) ; Zhang et al, Science, 276: 1268-1272 (1997); Vel- 
culescu et al, Cell, 88: 243-251 (1997). Both techniques 55 
have shown promise as potentially robust systems for ana- 
lyzing gene expression; however, there are still technical 
issues that need to be addressed for both approaches. For 
example, in microarray systems, genes to be monitored must 
be known and isolated beforehand, and with respect to 
current generation microarrays, the systems lack the com- 60 
plexity to provide a comprehensive analysis of mammalian 
gene expression, they are not readily re -usable, and they 
require expensive specialized data collection and analysis 
systems, although these of course may be used repeatedly. In 
sequence tag systems, although no special instrumentation is 65 
necessary and an extensive installed base of DNA sequenc- 
ers may be used, the selection of type lis tag-generating 



enzymes is limited, and the length (nine nucleotides) of the 
sequence tag in current protocols severly limits the number 
of cDNAs that can be uniquely labeled. It can be shown that 
for organisms expressing large sets of genes, such as mam- 
malian cells, the likelihood of nine -nucleotide tags being 
distinct for all expressed genes is extremely low, e.g. Feller, 
An Introduction to Probability Theory and Its Applications, 
Second Edition, Vol. I (John Wiley & Sons, New York, 
1971). 

It is clear from the above that there is a need for a 
technique to analyze gene expression that allows both the 
analysis of unknown genes and the unequivocal assignment 
of a sequence tag to an expressed gene. The availability of 
such techniques would find immediate application in medi- 
cal and scientific research, drug discovery, and genetic 
analysis in a host of applied fields, such as pest management 
and crop and livestock development. 

SUMMARY OF THE INVENTION 

In view of the above, objects of the present invention 
include, but are not limited to, providing a method for 
analyzing gene expression by tabulating sequence tags from 
expressed genes; providing a method of analyzing the 
expression of genes for which no previous sequence infor- 
mation exists; providing a method of recovering full length 
sequences of genes that display expression patterns of 
interest; providing a method of acquiring sequence tags of 
sufficient length for unequivocal identification of expressed 
genes; providing a method of measuring sequence frequen- 
cies in a population of polynucleotides; providing a method 
of genetic identification by tabulations of genomic sequence 
tags; and providing compositions and kits for implementing 
the method of the invention. 

The invention achieves these and other objects by pro- 
viding methods and materials for acquiring sequence tags 
from a population of polynucleotides, such as a cDNA or 
genomic library, or a sample thereof. In accordance with the 
invention, the nucleotide sequence of a portion of each end 
of each polynucleotide of the population is determined so 
that a pair of nucleotide sequences, or sequence tags, is 
obtained for each polynucleotide. Preferably, the method of 
the invention comprises the steps of i) providing a popula- 
tion of polynucleotides having predetermined ends; ii) 
inserting each polynucleotide of the population into a vector, 
so that the vector has at least one type lis restriction 
endonuclease recognition site adjacent to each end of the 
inserted polynucleotide, each type lis restriction endonu- 
clease recognition site being oriented such that a type lis 
restriction endonuclease recognizing either site cleaves the 
vector interior to the inserted polynucleotide; iii) cleaving 
each vector with one or more type Us restriction endonu- 
cleases recognizing the type lis restriction endonuclease 
recognition sites so that the vector is linearized and has a 
sequence tag of the inserted polynucleotide at each end; iv) 
re-circularizing the vector to form a pair of sequence tags for 
the inserted polynucleotide; and v) determining the nucle- 
otide sequence of each pair of sequence tags of a sample of 
re -circularized vectors. Preferably, the population of poly- 
nucleotides having predetermined ends is produced by 
digesting a cDNA library with one or more frequent-cutting 
restriction endonucleases, e.g. restriction endonucleases 
each having a four-base recognition sequences. Preferably, 
the pairs of sequence tags are tabulated to form a frequency 
distribution of sequences in the population of polynucle- 
otides which may be used directly, or related to the fre- 
quency distribution of sequences in another population, such 
as a cDNA library, from which the analyzed population is 
derived. In one aspect of the invention, the pairs of sequence 
tags are excised from the re-circularized vectors and ligated 
together to form a concatemers, which are cloned in a 
conventional sequencing vector. 
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The invention includes compositions and kits for imple- 
menting the method of the invention. Preferably, composi- 
tions of the invention include vectors for cleaving sequence 
tags from each end of an inserted polynucleotide, such as 
that illustrated in FIG. 1. Preferably, kits of the invention 5 
include a vector, together with appropriate buffers, restric- 
tion endonucleases, and the like, for carrying out the method 
of the invention. 

The present invention provides a means for analyzing 
gene expression by tabulating pairs of sequence tags from 10 
gene expression products, such as cDNAs. The invention 
provides several advantages over prior art methods of gene 
expression analysis, including the analysis of unknown 
genes, longer sequence tags for unequivocal gene 
identification, more flexibility in the selection of type lis 
restriction endonucleases for tag generation, means of 
retrieving sequences of interest, no specialized instrumen- 
tation required for practicing the invention, the existing and 
projected installed bases of DNA sequencers may be used 
with the invention, and the like. 

20 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 contains a diagram of a vector for forming pairs of 
nucleotide sequences in accordance with a preferred 
embodiment of the invention. 

DEFINITIONS 
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The term "oligonucleotide" as used herein includes linear 
oligomers of natural or modified monomers or linkages, 
including deoxyribonucleosides, ribonucleosides, and the 30 
like, capable of specifically binding to a target polynucle- 
otide by way of a regular pattern of monomer-to-monomer 
interactions, such as Watson-Crick type of base pairing, base 
stacking, Hoogsteen or reverse Hoogsteen types of base 
pairing, or the like. Usually monomers are linked by phos- 
phodiester bonds or analogs thereof to form oligonucleotides 35 
ranging in size from a few monomeric units, e.g. 34, to 
several tens of monomeric units, e.g. 40-60. Whenever an 
oligonucleotide is represented by a sequence of letters 
(upper case or lower case), such as "ATGCCTG," it will be 
understood that the nucleotides are in 5'-*3' order from left 40 
to right and that "A" denotes deoxyadenosine, "C" denotes 
deoxycytidine, "G" denotes deoxyguanosine, and "T" 
denotes thymidine, unless otherwise noted. Usually oligo- 
nucleotides comprise the four natural nucleotides; however, 
they may also comprise non-natural nucleotide analogs. It is 45 
clear to those skilled in the art when oligonucleotides having 
natural or non-natural nucleotides may be employed, e.g. 
where processing by enzymes is called for, usually oligo- 
nucleotides consisting of natural nucleotides are required. 

"Perfectly matched" in reference to a duplex means that 50 
the poly- or oligonucleotide strands making up the duplex 
form a double stranded structure with one other such that 
every nucleotide in each strand undergoes Watson-Crick 
basepairing with a nucleotide in the other strand. The term 
also comprehends the pairing of nucleoside analogs, such as 
deoxyinosine, nucleosides with 2-aminopurine bases, and 
the like, that may be employed. In reference to a triplex, the 
term means that the triplex consists of a perfectly matched 
duplex and a third strand in which every nucleotide under- 
goes Hoogsteen or reverse Hoogsteen association with a 
basepair of the perfectly matched duplex. 60 

As used herein, "nucleoside" includes the natural 
nucleosides, including 2'-deoxy and 2'-hydroxyl forms, e.g. 
as described in Romberg and Baker, DNA Replication, 2nd 
Ed. (Freeman, San Francisco, 1992). "Analogs" in reference 
to nucleosides includes synthetic nucleosides having modi- 65 
fied base moieties and/or modified sugar moieties, e.g. 
described by Scheit, Nucleotide Analogs (John Wiley, New 
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York, 1980); Uhiman and Peyman, Chemical Reviews, 90: 
543-584 (1990), or the like, with the only proviso that they 
are capable of specific hybridization. Such analogs include 
synthetic nucleosides designed to enhance binding 
properties, reduce complexity, increase specificity, and the 
like. 

As used herein, the term "complexity" in reference to a 
population of polynucleotides means the number of different 
species of polynucleotide present in the population. 

As used herein, "amplicon" means the product of an 
amplification reaction. That is, it is a population of 
polynucleotides, usually double stranded, that are replicated 
from one or more starting sequences. The one or more 
starting sequences may be one or more copies of the same 
sequence, or it may be a mixture of different sequences. 
Preferably, amplicons are produced either in a polymerase 
chain reaction (PCR) or by replication in a cloning vector. 

DETAILED DESCRIPTION OF THE 
INVENTION 

Methods and materials are provided for analyzing gene 
expression by tabulating sequence information from 
expressed genes. Polynucleotide products of expressed 
genes are preferably digested with one or more restriction 
endonucleases to produce a population of fragments with 
predetermined ends. Preferably, such polynucleotide 
products, which are usually cDNAs, are digested with one or 
more "frequent cutting" restriction endonucleases, so that 
fragments are formed having average lengths in the range of 
from a few tens of basepairs, e.g. 40-50, to a few hundreds 
of basepairs, e.g. 200-500, thereby assuring with high 
probability, e.g. >95%, and more preferably >98%, that 
every polynucleotide product will be cleaved at least once. 
Most preferably, frequent cutting restriction endonucleases 
consist of one or more restriction endonucleases having 
four-base recognition sites. Exemplary frequent cutting 
restriction endonucleases for use with the invention include 
Tsp 509 I, Nla III, Mbo I, Sau 3A I, Dpn II, Aci I, Hpa II, 
Msp I, Bfa I, HinPl I, Hha I, Mse I, Taq I, and the like. 
Preferably, frequent cutting restriction endonucleases are 
used which produce four-base overhangs, or protruding 
strands, such as Tsp 509 I, Nla III, Sau 3A, or the like. 

Depending on the embodiment, a randomly selected 
cDNAmay be represented by zero, one, or multiple pairs of 
sequence tags. If no linkers are added during cDNA library 
construction that contain restriction sites (described more 
fully below), no pairs of sequence tags will be obtained if the 
cDNA is cleaved only once or not at all by the one or more 
restriction enzymes used; a single pair of sequence tags will 
be obtained if two cleavage sites are present; and n-1 pairs 
of sequence tags will be obtained if n cleavage sites are 
present. In the preferred embodiment where linkers are 
added, these numbers become one, two, or multiple pairs of 
sequence tags, respectively. Consequendy, a frequency dis- 
tribution of pairs of sequence tags taken from a cDNA 
library will usually not reflect the actual frequencies of the 
mRNAs from which the library was derived. However, the 
observed frequencies of pairs of, sequence tags will be 
simple integral multiples of the actual frequencies; thus, 
changes in the relative frequencies of expressed sequences 
between two or more populations, e.g. cDNA libraries taken 
under different conditions, are readily observable. Moreover, 
multiple pairs of sequence tags per expressed gene also 
provide an internal control for tracking changes in 
frequencies, particularly for genes whose sequences are 
already known. If the frequency of an expressed gene 
doubles, then the frequency of each pair of its sequence tags 
should also double. The following table provides guidance 
regarding the changes in observed expression frequencies to 
be expected with application of the method of the invention: 
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Expected Number of Fragments (without Linkers) 
Probability of at Expected number Probability of at Expected number 



least 2 restriction of restriction least 2 restriction of restriction 
Length of cDNA sites of one fragments per sites of two fragments per 
(basepairs) 4-cutter cDNA 4-cutters cDNA 



500 .58 1.95 .90 3.9 

1000 .90 3.9 .996 7.8 

1500 .98 5.9 .999 11.7 

2000 .996 7.8 .999 15.6 



Thus, if a gene expression profile consisted of the expression 
four cDNAs 500, 1000, 1500, and 2000 basepairs in length 
in a proportion of 1:1:1:1, the observed profile under the 
method of the invention would be about 1:2:3:4, assuming 
that an adequate sample of pairs of sequence tag is taken, 
that the sequences of the expressed genes are known, and 
that fragments are generated by cleavage with a single 
four-base cutter. The latter ratio results because in a random 
sample of pairs of sequence tags, one would be four times 
more likely to select a pair from the 2000 basepair cDNA as 
from the 500 basepair cDNA, three times more likely to 
select a pair from the 1500 basepair cDNA as from the 500 
basepair cDNA, and so on. If under different conditions the 
expression of the 1000 basepair cDNA doubled resulting in 
an expression profile of 1:2:1:1, then the profile observed by 
application of the invention would be 1:4:3:4. If, for 
example, the sequence of the 500 basepair cDNA were 
unknown, so that there was no way to know that the 
fragments generated in the method of the invention were 
from the same gene, then the observed fragments generated 
in the method of the invention were from the same gene, 
then the observed expression profile would be more com- 
plex. If two fragments were generated from the 500 basepair 
cDNA, then an expression profile would consist of a ratio of 



15 five numbers: 1:1:4:6:8. Likewise, if the 2000 basepair 
cDNA was from an unknown gene and eight fragments were 
generated by the method, then the observed expression 
profile would correspond to the ratio 1:1:1:1:1:1:1:1:2:4:6. 
Pairs of sequence tags may be obtained from cDNAs 

20 without cleavage by a restriction endonuclease; however, 
one of the sequence tags of each pair in such embodiments 
typically consists of a segment of the polyAtail of the cDNA 
and therefore lacks information content. The number of such 
pairs of sequence tags provides an estimate of the total 

25 number of expressed sequences obtained in a sample. 

Preferably, the efficiency of detecting expressed genes is 
increase by employing linkers ligated to the ends of the 
cDNAs after second strand synthesis. Conventional proto- 
cols may be followed, e.g. Section III, Ausubel et al, editors, 

30 Current Protocols in Molecular Biology (John Wiley & 
Sons, New York, 1997); however, the usual methylation step 
of such conventional protocols is omitted. Preferably, the 
restriction site contained in a linkers is recognized by at least 
one of the restriction endonucleases used to generate the 

35 polynucleotides with predetermined ends. Thus, every 
cDNA will always give rise to at least one fragment. With 
linkers, the expected number of fragments per cDNA 
increases as follows: 



Expected Number of Fragments (with Linkers") 



Length of cDNA 
(basepairs) 

500 
1000 
1500 
2000 



Probability of at least 1 internal 
restriction site of single 4-cutter 
(i.e. equals the probability of 
there being at least two fragments) 

.857 
.980 
.997 
.999 



Expected Number Probability of at least 1 internal 

of restriction restriction sites of two 4-cutters 

fragments per (Le. equals the probability of 

cDNA there being at least two fragments) 



2.95 
4.9 
5.9 
8.8 



.90 
.996 
.999 
.999 



Expected Number 
of restriction 
fragments per 
cDNA 

4.9 

8.8 
12.7 
16.6 



Preferably, the method of the invention is carried out 
using a vector, such as that illustrated in FIG. 1. The vector 

55 is readily constructed from commercially available materials 
using conventional recombinant DNA techniques, e.g. as 
disclosed in Sambrook et al, Molecular Cloning, Second 
Edition (Cold Spring Harbor Laboratory, New York, 1989). 
Preferably, pUC-based plasmids, such as pUC 19, or Abased 

60 phages, such as XZAP Express (Stratagene Cloning 
Systems, La Jolla, Calif.), pZErO (Invitrogen Corp., 
Carlsbad, Calif.), or like vectors are employed. Important 
features of the vector are recognition sites (104) and (112) 
for two type Us restriction endonucleases that flank restric- 

65 tion fragment (108). For convenience, the two type lis 
restriction enzymes are referred to herein as "lis/* and 
"IIs 2 ", respectively. lis, and Its 2 may be the same or differ 
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ent. Recognition sites (104) and (112) are oriented so that the a pUC plasmid, restriction sites (106) and (110) are selected 

cleavage sites of Us 1 and IIs 2 are located in the interior of from restriction sites of polylinker region of the pUC 

restriction fragment (108). In other words, taking the 5' plasmid that upon cleavage leave ends compatible with ends 

direction as "upstream" and the 3' direction as left by the frequent cutting enzyme being employed. For 

"downstream/' the cleavage site of IIs 1 is downstream of its 5 example, Tsp 509 fragments may be inserted into an Eco RI 

recognition site and the cleavage site of IIs 2 is upstream of site, Nla III fragments may be inserted into Sph I or Nsp I 

its recognition site. Thus, when the vector is cleaved with sites, and Sau 3A fragments may be inserted into Bam HI, 

IIs a and IIs2 two segments (118) and (120) of restriction Bel I, Bgl II, or Bst YI sites. 

fragment (108) remain attached to the vector. The vector is Preferably, the vectors contain primer binding sites (100) 

then re -circularized by ligating the two ends together, 10 and (116) for primers p a and p 2 , respectively, which may be 

thereby forming a pair of segments, or sequence tags. If such used to amplify the pair of segments by PCR after 

cleavage results in one or more single stranded overhangs, re-circularization. Recognition sites (102) and (114) are for 

i.e. one or more non-blunt ends, then the ends are preferably restriction endonucleases w a and w 2 , which are used to 

rendered blunt prior to re-circularization, for example, by cleave the pair of segments from the vector after amplifi- 

digesting the protruding strand with a nuclease such as 15 cation. Preferably, w a and w 2 , which may be the same or 

Mung bean nuclease, T4 DNA polymerase, or the like, or by different, are type Us restriction endonucleases whose cleav- 

extending a 3' recessed strand, if one is produced in the age sites correspond to those of (106) and (1 10), thereby 

digestion, or by providing an adaptor mixture. The ligation removing surplus, or non-informative, sequence (such as the 

reaction for re-circularization is carried out under conditions recognition sites (104) and (112)) and generating protruding 

that favor the formation of covalent circles rather than 20 ends that permit concatenation of the pairs of segments, 

concatemers of the vector. Preferably, the vector concentra- As mentioned above, preferably polynucleotides for 

tion for the ligation is between about 0.4 and about 4.0 /ig/ml analysis by the method of the invention are derived from 

of vector DNA, e.g. as disclosed in Collins et al, Proc. Natl. mRNA extracted from a cell or tissue source. mRNAmay be 

Acad. Sci., 81: 6812-6812 (1984), for X-based vectors. For prepared by a commercially available mRNA extraction kit 

vectors of different molecular weight, the concentration 25 using conventional protocols, e.g. Poly ATract series 9600 kit 

range is adjusted appropriately, e.g. Dugaizxyk et al, J. Mol. (Promega, Madison, Wis.); FastTrack 2.0 kit (Invitrogen, 

Biol., 96: 171-184 (1975). ^ Calif.); Dynabeads OHgo(dT) 25 (Dynal, Oslo, Norway), or 

In the preferred embodiments, the number of nucleotides the like. After extraction, mRNA is converted into cDNA 
identified depends on the "reach" of the type lis restriction using conventional protocols with minor modifications, such 
endonucleases employed. "Reach" is the amount of separa- 30 as omission of methylation steps to ensure that the cDNA 
tion between a recognition site of a type Us restriction can be cleaved with selected restriction endonucleases. 
endonuclease and its cleavage site, e.g. Brenner, U.S. Pat. Again, cDNA synthesis may be accomplished using corn- 
No. 5,559,675. The conventional measure of reach is given mercially available kits, e.g. StrataScript RT-PCR kit 
as a ratio of integers, such as "(16/14)", where the numerator (Stratagene Cloning Systems, La Jolla, Calif.); SMART 
is the number of nucleotides from the recognition site in the 35 PCR cDNA Synthesis kit (Clontech Laboratories, Palo Alto, 
5'-*3' direction that cleavage of one strand occurs and the Calif.); Riboclone cDNA Synthesis System (Promega Corp., 
denominator is the number of nucleotides from the recog- Madison, Wis.); or the like. Preferably, a protocol is 
nition site in the 3 , -*-5' direction that cleavage of the other employed which results in the conversion of mRNA into 
strand occurs. Preferred type Us restriction endonucleases blunt-ended double-stranded cDNA, after which linkers, 
for use as IISj and Hsj in the preferred embodiment include 40 each containing a selected restriction site, are ligated to the 
the following: Bbv 1, Bee 83 1, Beef 1, Bpm I, Bsg I, BspLU cDNA. The selected restriction site preferably corresponds 
11 III, Bst 71 I, Eco 57 I, Fok I, Gsu I, Hga I, Mme I, and to that of, or includes a site of, one of the one or more 
the like. In the preferred embodiment, a vector is selected restriction endonucleases used to generate a population of 
which does not contain a recognition site, other than (104) polynucleotides, e.g. cDNA fragments, with predetermined 
and (112), for the type lis enzyme(s) used to generate pairs 45 ends. Alternatively, a bio tiny lated oligo-dT primer is pro- 
of segments; otherwise, re-circularization cannot be carried vided for first strand synthesis which results in the produc- 
out. Preferably, a type Us restriction endonuclease for gen- tion of cDNAs having a biotin group that permits purifica- 
erating pairs of segments has as great a reach as possible to tion on a conventional avidinated solid phase support, e.g. 
maximize the probability that the nucleotide sequences of M-280 Dynabeads (Dynal, Oslo, Norway). Preferably, link- 
the segments are unique. 50 ers containing a recognition site of the selected four-base 

Immediately adjacent to Us sites (104) and (112) are cutter are ligated to the opposite ends of the cDNAs. After 

restriction sites (106) and (1 10), respectively that permit affinity purification, the cDNAs may be digested with a 

restriction fragment (108) to be inserted into the vector. That selected four-base cutting endonuclease and the released 

is, restriction site (106) is immediately downstream of (104) fragments used for analysis in accordance with the inven- 

and (110) is immediately upstream of (112). Preferably, sites 55 tion. 

(104) and (106) are as close together as possible, even In some applications of the invention, it may be desirable 

overlapping, provided type lis site (106) is not destroyed to employ a cDNA construction technique that maximizes 

upon cleavage with the enzymes for inserting restriction the production of full length cDNAs. In this way, cDNAs 

fragment (108). This is desirable because the recognition site that are randomly truncated near their 5' ends are minimized 

of the restriction endonuclease used for generating the 60 and a source of noise in the gene expression measurements 

fragments occurs between the recognition site and cleavage is reduced or eliminated. Techniques for full length cDNA 

site of type lis enzyme used to remove a segment for production are disclosed in Carninci et al, DNA Research, 4: 

sequencing, i.e. it occurs within the "reach" of the type lis 61-66 (1997); and Cap Finder PCR cDNA Synthesis kit 

enzyme. Thus, the closer the recognition sites, the larger the product literature (Clontech Laboratories, Palo Alto, Calif.), 

piece of unique sequence can be removed from the fragment. 65 Alternatively, 3' biases in clone representation can be 

The same of course holds for restriction sites (110) and reduced by using a random priming technique for first strand 

(112). Preferably, whenever the vector employed is based on synthesis of cDNAs, e.g. Koike et al, Nucleic Acids 
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Research, 15: 2499 (1987). Random-primer kits are com- 
mercially available, e.g. Ribo Clone cDN A Synthesis System 
(Promega Corp., Madison, Wis.); or the like. 

After insertion of the fragments into a vector, a suitable 
host is transformed with copies of the vector and cultured, 
i.e. expanded, using conventional techniques. Transformed 
host cells are then selected, e.g. by plating and picking 
colonies using a standard marker, e.g. (3-glactosidase/X-gal. 
Alternatively, the fragments may be cloned into a vector 
which forces selection against non-recombinants, e.g. 
pZErO series of vectors available from Invitrogen Corp. 
(Carlsbad, Calif.). A large enough sample of recombinant- 
containing host cells is taken to ensure that at least one pair 
from every fragment is present for analysis with a reason- 
ably large probability. The number of fragments, N, that 
must be in a sample to achieve a given probability, P, of 
including a given fragment is the following: N=ln(l-P)/ln 
(1-f), where f is the frequency of the fragment in the 
population. Thus, for a population of 10,000 different kinds 



recognized by their length and their spacing between known 
recognition sites, and in this embodiment, each pair of 
sequence tags requires that a sequence of 22 nucleotide be 
identified. Assuming that 20 pairs, or 440 bases, are 
5 sequenced in each sequencing reaction in a conventional 
sequencing protocol, about 2300 sequencing reactions must 
be carried out and the same number of electrophoretic 
separations must be made to analyzed 46,000 pairs of 
sequence tags. 

10 As mentioned above, multiple frequent cutting restriction 
endonucleases may be employed in which case multiple 
cloning vectors or adaptors must be used for capturing all 
fragment types. For example, is if two frequent cutters r and 
q are used, three fragment types are produced: those with 

15 both ends resulting from cleavage by r, or r-r fragments; 
those with both ends resulting from cleavage by q, or q-q 
fragments; and those with mixed ends, or r-q fragments. 
Linkers may also be employed in such multiple enzyme 
embodiments. A single cloning vector may be used if 



of cDNA, a sample containing 69,000 vectors will include at 20 adaptors are provided to convert the ends of the various 



least one copy of each fragment (even those present at a 
frequency of 1 in 10,000) with a probability of 99.9%; and 
a sample containing 46,000 vectors will include at least one 
copy of each fragment with a probability of 99%. For this 
calculation, it is assumed that each cDNAis cleaved into the 25 
same number of fragments. By varying the number of pairs 
sequenced, the sensitivity of the technique for detecting 
changes in expression can also be varied. Preferably, a 
sample size is employed that results in a least one copy of 
every sequence present at a frequency of 0.1 percent in the 30 
population being studied with a probability of 99%. More 
preferably, a sample size is employed that results in a least 
one copy of every sequence present at a frequency of 0.01 
percent in the population being studied with a probability of 
99%. 35 

After selection, the vector-containing hosts are combined 
and expanded in cultured. The vectors are then isolated, e.g. 
by a conventional mini-prep, or the like, and cleaved with 
lls 1 and IIs2. The fragments comprising the vector and ends 
(i.e. segments) of the restriction fragment insert are isolated, 40 
e.g. by gel electrophoresis, blunted, and re-circularized. The 
resulting pairs of segments in the re-circularized vectors are 
then amplified, e.g. by polymerase chain reaction (PCR), 
after which the amplified pairs are cleaved with w to free the 
pairs of sequence tags, which are then isolated, e.g. by gel 45 
electrophoresis, or like technique. Preferably, .the isolated 
pairs are concatenated in a conventional ligation reaction to 
produce concatemers of various sizes, which are separated, 
e.g. by gel electrophoresis. Concatemers greater than about 



fragment types to ends that allow insertion into the cloning 
vector. Preferably, in such embodiments, the adaptors 
include a recognition site for the type lis restriction endo- 
nuclease used to generate sequence tags. For example, if Tsp 
509 and Sau 3 A are used to generate fragments from a cDNA 
library and if Bsg I is the type lis restriction endonuclease 
used to generate sequence tags, such adaptors can have the 
following form (SEQ ID NO: 1, SEQ ID NO: 2, and SEQ ID 
NO: 3) for insertion into an Eco RI site of a cloning vector: 



Formula 

Compatible 
Eco RI Bsgl End 

I \ i 

ggctaggaattcattcgtgcag 
ccgatccttaagtaagcacgtcttaa 

ggctaggaattcattcgtgcag 
ccgatccttaagtaagcacgtcctag 

Thus, after a cDNA library is digested to completion with 
Tsp 509 and Sau 3A, the above adaptors are ligated to the 
ends of the fragments followed by digestion with Eco RI. 
The fragments are then treated as described above in the 
single frequent cutter embodiment. 

The following examples serve to illustrate the present 
invention and are not meant to be limiting. Selection of 
many of the reagents, e.g. enzymes, vectors, and other 
materials; selection of reaction conditions and protocols; and 



200-300 basepairs are isolated and cloned into a standard 50 material specifications, and the like, are matters of design 



sequencing vector, such as pUC 19, pBluescript, Ml 3, or the 
like. The sequences of the cloned concatenated pairs are 
analyzed on a conventional DNA sequencer, such as a model 
377 DNA sequencer from Perkin-Elmer Applied Biosystems 
Division (Foster City, Calif.). 

In the above embodiment, the sequences of the pairs of 
segments are readily identified between sequences for the 
recognition site of the enzymes used in the digestions. For 
example, when pairs are concatenated from fragments pro- 
duced by digestion with frequent cutting enzyme r and 
cleavage with a type lis restriction endonuclease of reach 
(16/14), the following pattern is observed: 

^^^r^^N^ITrNNN^w^ 



55 



60 



choice which may be made by one of ordinary skill in the art. 
Extensive guidance is available in the literature for applying 
particular protocols for a wide variety of design choices 
made in accordance with the invention, e.g. Sambrook et al, 
Molecular Cloning, Second Edition (Cold Spring Harbor 
Laboratory, New York, 1989); Ausubel et al, editors, Current 
Protocols in Molecular Biology (John Wiley & Sons, New 
York, 1997); and the like. 

EXAMPLE 1 

Analysis of Yeast Gene Expression by Tsp 509 
Digestion of a cDNA Library having Eco RI 

Linkers 



where "r" represents the nucleotides of the recognition sites 65 In this example, a cDNA library is constructed from 
of restriction endonuclease r, and where the N's are the mRNA extracted from Saccharomyces cerevisiae cells of 
nucleotides of the pairs of sequence tags. Thus, the pairs are strain YPH499 (ATCC accession No. 76625). After ligation 
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of commercial Eco RI linkers, the cDNAs are digested to 
completion with four-base cutter, Tsp 509 1, and are inserted 
into a pUC 19 cloning vector modified as described below 
for explansion and generation of pairs of sequence tags. The 
pairs of sequence tags are excised from the vector, ligated to 5 
form concatemers, cloned, and sequenced. 

Synthetic oligonucleotides (i) through (iv) are combined 
with an Eco RI and Hind III digested pUC 19 in a conven- 
tional ligation reaction so that they assemble into the double 
stranded insert of Formula II: 
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10 ftg mRNA from the yeast cells is reverse transcribed 
with a commercially available kit (e.g., RiboClone cDNA 
Synthesis System, Promega Corp., Wis.) which follows the 
protocol described in Ausubel et al (cited above), pages 
5.5.1-5.5.13 and 5.6.1-5.6.10. Briefly, 10 /*g mRNA at a 
concentration of 1 figffA is heated in a tightly sealed micro- 
centrifuge tube for 5 min at 65° C, then placed immediately 
on ice. In a separate tube, the following components are 
added in the following order to give a total volume of about 
180 /d: 20 fA 5 mM dNTPs (each at 500 final 
concentration); 40 /d 5x RT buffer (for a final concentration 



(i) 5 1 -aattagccgtacctgcagcagtgcagg (SEQ ID 

(ii) 5 ' -p-aattcctgcacagctgcgaatcattcg {SEQ ID 

(iii) 5 ' -agctcgaatgattcgcagctgt (SEQ ID 

(iv) 5 ' -p-gcaggaattcctgcactgctgcaggtacggct (SEQ ID 

where the 5' "p's" in formulas (ii) and (iv) represent 5' 
phosphate groups. 



Formula II 

(SEQ ID NO:8) 

Bbv I Bag I 

\ 1 

5 ' -AATTAGCCGTACCTGCAGCAGTGCAG- 

TCGGCATGGACGTCGTCACGTC- 

- GAATTCCTGCACAGCTGCGAATCATTCG 

- CTTAAGG ACGTGTCG AC GCTTAGT AAGCTC GA 

t t t 

Eco RI Bsg I Bbv I 

Note that the insert has compatible ends to the Eco RI-Hind 
Ill-digested plasmid, but that the original Eco RI and Hind 
III sites are destroyed upon ligation. The horizontal arrows 
above and below the Bsg I and Bbv I sites indicate the 
direction of the cleavage site relative to the recognition site 
of the enzymes. After ligation, transformation of a suitable 
host, and expansion, the modified pUC 19 is isolated and the 
insert is sequenced to confirm its identity. 

Yeast cells are grown at 30° C. in YPD rich medium, YPD 
supplemented with 6 mM uracil, 4.8 mM adenine, and 24 
mM tryptophan)(Rose et al, Methods in Yeast Genetics 
(Cold Spring Harbor Laboratory Press, 1990)). Cell density 
is measured by counting cells from duplicate dilutions, and 
the number of viable cells per milliliter is estimated by 
plating dilutions of the cultures on YPD agar immediately 
before collecting cells for mRNA extraction. Cells is mid- 
log phase (l-5xl0 7 cells/ml) are pelleted, washed twice with 
AE buffer solution (50 mM NaAc, pH 5.2, 10 mM EDTA), 
frozen in a dry ice-ethanol bath, and stored at -80° C. 

Total RNAis extracted from frozen cell pellets using a hot 
phenol method, described by Schmitt et al, Nucleic Acids 
Research, 18: 3091-3092 (1990), with the addition of a 
chloroform-isoamyl alcohol extraction just before precipi- 
tation of the total RNA. Phase-Lock Gel (5 Prime-3 Prime, 
Inc., Boulder, Colo.) is used for all organic extractions to 
increase RNA recovery and decrease the potential for con- 
tamination of the RNA with material from the organic 
interface. Poly(A) + RNAis purified from the total RNA with 
an oligo-dT selection step (Oligotex, Qiagen, Chatsworth, 
Calif.). 



NO: 4) 
NO: 5) 
NO: 6) 
NO: 7) 

20 

of lx); 10 p\ 200 mM dithiothreitol (10 mM final 
concentration); 20 /d 0.5 mg/ml oligo(dT) 12 _ 18 (50 /*g/ml 
final concentration); 60 /d H 2 0; and 10 jul (10 units) RNasin 
(50 units/ml final concentration). 5x RT buffer is 250 //1 1M 

25 Tris-Cl (pH 8.2); 250 /d 1M KC1; 30 /d 1M MgCL,; and 470 
[A HjO. The components are mixed by vortexing, briefly 
microcentrifuged, and then added to the tube containing the 
RNA, after which 20 /d AMV reverse transcriptase (200 
units) is added for a final concentration of 1000 units/ml in 
200 fil. After mixing by vortexing, 10 [A of the mixture is 

30 removed to a separate tube containing 1 fA of [a- 32 P]dCTP, 
after which both tubes are incubated at room temperature for 
5 min, then at 42° C. for 1.5 hours. After 1.5 hours, 1 /A of 
0.5M EDTA (pH 8.0) is added to the tube with the radio- 
active label to quench the reaction. This sample is used to 

35 estimate the amount of cDNA synthesized in the reaction. To 
the main reaction, 4 jul of 0.5M EDTA (pH 8.0) and 200 /d 
buffered phenol is added. After vortexing, the mixture is 
microfuged at room temperature for 1 min to separate the 
phases, after which the upper aqueous phase is transferred to 

40 a new tube. To the phenol layer, add 100 /d TE buffer (pH 
7.5), vortex, and microcentrifuge as described above. 
Remove the aqueous layer and add it to the aqueous phase 
from the first extraction. To the aqueous solution, add 1 ml 
diethyl ether, vortex, and microcentrifuge as described 

45 above, after which the upper (ether) layer is removed with 
a glass pipet and discarded. Repeat the extraction with an 
additional 1 jul diethyl ether. Add 125 /d of 7.5M ammonium 
acetate to the aqueous phase (to give a final concentration of 
about 2.0-2.5M) and 950 /d of 95% ethanol. Place in dry 

50 ice/ethanol bath 15 min, warm to 4° C, and microcentrifuge 
at 4° C. for 10 min at full speed to pellet the nucleic acids, 
which may be visible as a small yellow-white pellet. After 
removing the supernatant with a pipet, fill the tube with 
ice-cold 70% ethanol, and microcentrifuge at 4° C. for 3 min 

55 at full speed. Remove the supernatant and dry the tube 
containing the precipitated DNA in a vacuum desiccator. 
Resuspend the pellet from the first-strand synthesis in 284 /A 
water and add to the tube the following components in the 
following order to give a final volume of 400 /d; 4 ;d 5 mM 

60 dNTPs (50 fiM final concentration each); 80 /d 5x second- 
strand buffer (to give a lx final concentration); 12 /d 5 mM 
P-AND + (150 fiM final concentration); and 2 /d 10 fiCi/fid 
[a- 32 P]dCTP (50 /*Ci/ml final) to monitor nucleotide incor- 
poration. 5x second -strand buffer is 100 jul I M Tris-Cl (pH 

65 7.5), 500 jul 1M KC1, 25 /d 1M MgCL>, 50 /d 1M (NH 4 ) 2 S0 4 , 
50 /d 1M dithiothreitol, 50 (A 5 mg/ml bovine serum 
albumin, and 225 111 H 2 0. After vortexing, briefly 
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microcentrifuge, then add the following: 4 fA (4 units) 
RNase H (10 units/ml final concentration); 4 fA (20 units) E. 
coli DNA ligase (50 units/ml final); and 10 fA (100 units) E. 
coli DNA polymerase I (250 units/ml final). After vortexing 
and briefly microcentrifuging, the mixture is incubated at 5 
14° C. for 12 to 16 hours. After second strand synthesis is 
complete, phenol extract the reaction mixture with 400 fA 
buffered phenol and remove the aqueous phase. Back extract 
the phenol phase with 200 fA TE (pH 7.5) as described 
above. Pool the aqueous phases and extract twice with 900 10 
fA ether, as described above, to give a final aqueous phase of 
about 600 fA. Divide the aqueous phase evenly between two 
tubes, add ammonium acetate, and ethanol precipitate, as 
described above. Second strand synthesis is completed and 
the ends of the cDNA blunted as follows: Resuspend the 15 
pooled pellets in 42 fA water and add the following com- 
ponents in the following order to give a final volume of 80 
fA: 5 fA 5 mM dNTPs (310 fM final concentration each); 16 
fA 5x TA buffer (lx final concentration); and 1 fA 5 mM 
(3-NAD* (62 fM final concentration). 5x TA buffer is 200 fA 20 
1M Tris-acetate (pH 7.8); 400 fA 1M potassium acetate, 60 
fA 1M magnesium acetate, 3 fA 1M dithiothreitol, 105 fA 5 
mg/ml bovine serum albumin, and 432 fA H^O. After vor- 
texing and briefly microcentrifuging, the following are 
added: 4 fA of 2 figlval RNase A (100 ng/ml final 25 
concentration); 4 fA (4 units) RNase H (50 units/ml final); 4 
fA (20 units) E. coli DNA ligase (250 units/ml final); and 4 
fA (8 units) T4 DNA polymerase (100 units/ml final). The 
mixture is vortexed, briefly microcentrifuged, and incubated 
45 min at 37° C, after which 120 fA TE (pH 7.5) and 1 fA 30 
of 10 mg/ml tRNA is added. The resulting mixture is 
extracted with 200 fA buffered phenol. After removal of the 
aqueous phase, the phenol phase is back extracted with 100 
fA TE as described above. The two aqueous phases are 
pooled and extracted twice with 1 ml ether, as described 35 
above, after which the cDNA is ethanol precipitated as 
described above. 

Eco RI linkers (New England Biolabs, Beverly, Mass.) are 
ligated to the ends of the cDNAs in a conventional ligation 
reaction: cDNA from the above reaction is disolve in 23 fA 40 
water, after which the following components are added in 
the following order: 3 fA lOx T4 DNA ligase buffer 
(manufacturer's recommendation) containing 5 mM ATP (to 
lx final buffer concentration and 0.5 mM final ATP 
concentration), and 2 fA 1 jug/l phosphorylated Eco RI 45 
linkers (67 fig/m\ final concentration) to give a final volume 
of 30 fA. After gentle mixing, 2 fA (800 units) T4 DNA ligase 
(New England Biolabs) is added (27,000 units/ml final) and 
the mixture is incubated overnight at 4° C. After microcen- 
trifuging briefly, the ligase is inactivated by heating the 50 
reaction mixture to 65° C. for 10 min in a water bath, after 
which the mixture is placed on ice for 2 min. To the reaction 
mixture, the following components are added in the follow- 
ing order: 95 fA U^O and 15 fA lOx Eco RI buffer (lx final 
concentration). After gentle mixing, 10 fA (200 units) Eco RI 55 
is added to give a final concentration of 1300 units/ml and 
the mixture is incubated for 4 hours at 37° C. After such 
incubation, an additional 3 fA (60 units) of Eco RI is added 
to the mixture, after which it is gently mixed and incubated 
another hour at 37° C. to ensure complete digestion of the 60 
cDNA and linkers. The restriction fragments are separated 
from the rest of the reaction mixture by CL-4B column 
chromatography, e.g. as taught by Ausebel et al, unit 5.6 
Current Protocols (cited above). Alternatively, fragments 
may be purified by passing the reaction mixture through a 65 
conventional spin column, such as a Chroma Spin -30 col- 
umn (Clontech Laboratories, Palo Alto, Calif.), or the like. 



As another alternative, ethidium-labeled fragments may be 
purified by agarose gel electrophoresis, followed by excision 
of the fragment-containing portion of the gel and dialysis. 
After purification, the fragments are ethanol precipitated. 

1 fig (0.57 pmol) of the above -modified pUC 19 plasmid 
is digested with Eco RI in Eco RI buffer as recommended by 
the manufacturer (New England Biolabs, Beverly, Mass.), 
purified by phenol extraction and ethanol precipitation, and 
ligated to a two molar excess of fragments (about 200 ng) in 
a conventional ligation reaction. A bacterial host is 
transformed, e.g. by electroporation, and plated so that hosts 
containing recombinant plasmids are identified by white 
colonies. 25,000 colonies are picked and expanded in liquid 
culture. 

Plasmid DNA is isolated by conventional alkaline lysis 
followed by anion-exchange purification using a Qiagen-tip 
20 plasmid purification kit (Santa Clarita, Calif.), or like kit. 
1 fig of purified plasmid DNA is digested to completion with 
Bsg I using the manufacturer's protocol (New England 
Biolabs, Beverly, Mass.), and after phenol extraction, the 
vector-containing fragment is separated by agarose gel elec- 
trophoresis followed by isolation with a QIAquick Gel 
Extraction Kit (Qiagen, Inc., Santa Clarita, Calif.). The ends 
of the isolated fragment are then blunted by Mung bean 
nuclease (using the manufacturer's recommended protocol, 
New England Biolabs), after which the blunted fragments 
are purified by phenol extraction and ethanol precipitation. 
The fragments are then resuspended in a ligation buffer at a 
concentration of about 1 fig/ml in a 0.5 ml reaction volume. 
The dilution is designed to promote self-ligation of the 
fragments, following the protocol of Dugaiczyk et al (cited 
above). After ligation and concentration by ethanol 
precipitation, the pairs of segments carried by the plasmids 
are amplified by PCR using primers p 2 and p 2 . Preferably, p a 
and p 2 are selected to bind to regions of the vector 5' and 3' 
of the polylinker site, respectively, so that amplification 
results in a amplicon of about 110-150. basepairs. 18-mer 
primers are employed with the 5' most nucleotide of pj 
binding to a complementary nucleotide 64 bases upstream of 
the Eco RI insertion site and the 5' most nucleotide of p 2 
binding to a complementary nucleotide 36 bases down- 
stream of the Eco RI insertion site. In this manner, three 
readily readily separatable fragments are product upon 
digestion with w 2 and w 2 . 15-20 amplification cycles are 
carried out so that at least about a 1000-fold amplification is 
achieved. The amplified product is purified with a QIAquick 
PCR Purification Kit (Qiagen, Inc.), or like procedure, after 
which it is cleaved with Bbv I using the manufacturer's 
recommended protocol (New England Biolabs). After iso- 
lation by polyacrylamide gel electrophoresis and 
purification, the pairs are concatenated by carrying out a 
conventional ligation reaction. The concatenated fragments 
are separated by polyacrylamide gel electrophoresis and 
concatemers greater than about 200 basepairs are isolated 
and ligated into a Phagescript SK sequencing vector 
(Stratagene Cloning Systems, La Jolla, Calif.). Preferably, a 
number of clones are expanded and sequenced that ensure 
with a probability of at least 99% that all of the pairs of the 
aliquot are sequenced. A "lane" of sequence data (about 600 
bases) obtained with conventional sequencing provides the 
sequences of about 25 pairs of segments. Thus, after 
transfection, a 1000 individual clones are expanded and 
sequenced on a commercially available DNA sequencer, e.g. 
PE Applied Biosystems model 377, to give the identities of 
about 25,000 pairs of segments. 

EXAMPLE 2 

Analysis of Human Pancreatic Cell Expression by 
Nla III Digestion of a cDNA Library Purified on 
Solid Phase Supports 

In this example, a cDNA library is constructed from 
human pancreatic mRNA available commercially from 
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Clontech Laboratories (Palo Alto, Calif.). After first strand 
synthesis using a 5-biotinylated poly(dT) primer, second 
strand synthesis is accomplished using random primers with 
a conventional protocol. After Sph I linkers are ligated to the 
cDNAs, they are affinity purified with avidinated magnetic 
bead, digested to completion with four-base cutter, Nla III, 
and the released fragments are purified and inserted into a 
pUC 19 cloning vector modified as described below for 
explansion and generation of pairs of sequence tags. The 
pairs of sequence tags are excised from the vector, ligated to 
form concatemers, cloned, and sequenced, as described in 
Example 1. 

The following insert is prepared for ligation into an Eco 
RI-Hind Ill-digested pUC 19: 



Formula III 

(SEQ ID NO: 9) 

Bbv I Bsg I 

i I 
5 * -AATTAGCCGTACCTGCAGCAGTGCAG- 
TCGGCATGGACGTCGTCACGTC- 

-GCATGCCTGCACAGCTGCGAATCATTCG 
-CGTACGGACGTGTCGACGCTTAGTAAGCTCGA 

t t t 

Sph I Bsg I Bbv I 

As above, after ligation, transformation of a suitable host, 
and expansion, the modified pUC19 is isolated and the insert 
is sequenced to confirm its identity. 

5 jug of mRNAis converted into biotinylated cDNA using 
a conventional cDNA synthesis kit (Capture Clone Magnetic 
cDNA Synthesis and Ligation System, Promega Corp., 
Madison, Wis.), after which Sph I linkers (New England 
Biolabs, Beverly, Mass.) are ligated to the blunt ends of the 
cDNAs. The cDNAs are then affinity purified with avid- 
inated magnetic beads following the manufacturer's sug- 
gested protocol. The bead-cDNA conjugates are resus- 
pended in a cleavage buffer (NEBuffer 4 plus bovine serum 
albumin, New England Biolabs, Beverly, Mass.) for cleav- 
age with Nla III (New England Biolabs) following the 
manufacturer's protocol (»4-5 units Nla III incubated for 1 
hour at 37° C). After separating the beads from the reaction 
mixture, the released fragments are isolated by phenol 
extraction followed by ethanol precipitation. The fragments 
are then inserted into the above-modified Sph I-digested 
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pUC 19. The procedure of Example 1 is followed thereafter 
so that concatemers of pairs are formed, cloned, and 
sequenced as described. 

5 EXAMPLE 3 

Analysis of Yeast Gene Expression by Sau 3 A and 

Tsp 509 

Digestion of a cDNA Library Followed by Adaptor Liga- 

10 tion After double stranded blunt-end cDNA is produced as 
described in Example 1, it is digested to completion with 
Sau 3A using the manufacturers (New England Biolabs) 
suggested protocol. The restriction fragments are removed 
from the reaction mixture by phenol extraction and ethanol 

15 precipitation, after which the precipitate is re-suspended in 
NEBuffer No. 1. 10 units of Tsp 509 is added to give a 50 
ju\ reaction volume which is incubated at 65° C. for 1 hour. 
After phenol extraction and ethanol precipitation, the frag- 
ments are resuspended in T4 DNAligase buffer, as described 

20 in Example I. The adaptors of Formula I are added to the 
reaction mixture in approximately 10-fold concentration 
excess over that of the fragments. T4 DNA ligase is added 
under conventional reaction conditions. After incubation, 
the adaptors are separated from the fragments by a com- 

25 mercially available anion-exchange column (Qiagen), and 
the isolated fragments are then digested to completion with 
Eco RI using the manufacturer's (New England Biolabs) 
recommended protocol. After isolation by phenol extraction 
and ethanol precipitation, the Eco RI fragments are inserted 

30 into the Eco RI cloning site of the pZErO-1 vector 
(Invitrogen, Carlsbad, Calif.) using the manufacturer's 
instructions. After transformation and selection, isolated 
vectors are treated to produced concatemers of pairs as 
described above. 

The foregoing disclosure of preferred embodiments of the 
invention has been presented for purposes of illustration and 
description. It is not intended to be exhaustive or to limit the 
invention to the precise form disclosed, and obviously many 
modifications and variations are possible in light of the 

40 above teaching. The embodiments were chosen and 
described in order to best explain the principles of the 
invention and its practical application, to thereby enable 
others skilled in the art to best utilize the invention in various 
embodiments and with various modifications as are suited to 

45 the particular use contemplated.lt is intended that the scope 
of the invention be defined by the claims appended hereto. 



SEQUENCE LISTING 



<160> NUMBER OF SEQ ID NOS : 9 

<210> SEQ ID NO 1 
<211> LENGTH: 22 
<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 
<220> FEATURE: 
<221> NAME/KEY: 
<222> LOCATION: 

<223> OTHER INFORMATION: Single strand of adaptor 
<4 00> SEQUENCE: 1 



ggctaggaat tcattcgtgc ag 
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-continued 



<210> SEQ ID NO 2 

<211> LENGTH: 26 

<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 

<220> FEATURE : 

<221> NAME /KEY : 

<222> LOCATION: 

<223> OTHER INFORMATION: Single strand of adaptor 

<400> SEQUENCE: 2 

aattctgcac gaatgaattc ctagcc 26 

<210> SEQ ID NO 3 
<211> LENGTH: 26 
<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 
<220> FEATURE: 
<221> NAME /KEY: 
<222> LOCATION: 

<223> OTHER INFORMATION: Single strand of adaptor 
<400> SEQUENCE: 3 

gatcctgcac gaatgaattc ctagcc 26 



<210> SEQ ID NO 4 

<211> LENGTH: 27 

<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 

<220> FEATURE: 

<221> NAME /KEY: 

<222> LOCATION: 

<223> OTHER INFORMATION: Single strand of adaptor 

<400> SEQUENCE: 4 

aattagccgt acctgcagca gtgcagg 27 



<210> SEQ ID NO 5 
<211> LENGTH: 27 
<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 
<220> FEATURE: 
<221> NAME /KEY: 
<222> LOCATION: 

<223> OTHER INFORMATION: Single strand of adaptor 
<4 00> SEQUENCE: 5 

aattcctgca cagctgcgaa tcattcg 27 



<210> SEQ ID NO 6 
<211> LENGTH: 22 
<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 
<220> FEATURE: 
<221> NAME /KEY: 
<222> LOCATION: 

<223> OTHER INFORMATION: Single strand of adaptor 
<4 00> SEQUENCE: 6 

agctcgaatg attcgcagct gt 22 



<210> SEQ ID NO 7 
<211> LENGTH: 32 
<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 
<220> FEATURE: 
<221> NAME/KEY: 
<222> LOCATION: 

<223> OTHER INFORMATION: Single strand of adaptor 
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-continued 



<400> SEQUENCE : 7 

gcaggaattc ctgcactgct gcaggtacgg ct 32 

<210> SEQ ID NO 8 
<211> LENGTH: 54 
<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 
<220> FEATURE: 
<221> NAME / KEY : 
<222> LOCATION: 

<223> OTHER INFORMATION: Double stranded insert 
<400> SEQUENCE: 8 

aattagccgt acctgcagca gtgcaggaat tcctgcacag ctgcgaatca 5 0 

ttcg 54 

<210> SEQ ID NO 9 

<211> LENGTH: 54 

<212> TYPE: DNA 

<213> ORGANISM: Artificial Sequence 

<220> FEATURE: 

<221> NAME /KEY: 

<222> LOCATION: 

<223> OTHER INFORMATION: Double stranded insert 

<400> SEQUENCE: 9 

aattagccgt acctgcagca gtgcaggcat gcctgcacag ctgcgaatca 5 0 

ttcg 54 



I claim: 

1. A method of analyzing gene expression in a cell or 35 
tissue, the method comprising the steps of 

(a) forming a population of cDNA molecules from mRNA 
of a cell or tissue; 

(b) digesting the population of cDNA molecules with at 
least one restriction endonuclease to produce a popu- 40 
lation of polynucleotides having predetermined ends; 

(c) enzymatically removing a segment of nucleotides 
from each predetermined end of each polynucleotide 
and ligating the segments from each end together to 
form a pair of sequence stages for each polynucleotide, 
wherein said segments are formed by inserting each of 45 
said polynucleotides into a cloning site of a vector, the 
cloning site being flanked by a first type lis restriction 
site and a second type lis restriction site such that a type 
lis restriction endonuclease recognizing either said first 

or second sites cleaves the vector within to said 50 
polynucleotide, the first lis restriction site and the 
second type Us restriction site being the same or 
different and each of the first and second type Us 
restriction sites being unique to the vector; 

(d) determining the nucleotide sequences of a sample of 55 
pairs of sequence tags; and 

(e) tabulating the nucleotide sequences of the pairs of 
sequence tags to form a frequency distribution of gene 
expression in the cell or tissue. 

2. The method of claim 1 wherein said step of determining 
said nucleotide sequences includes the steps of ligating said 60 
sample of pairs of sequence tags together to form one or 
more concatenations of pairs of sequence tags and sequenc- 
ing the concatenations of pairs of sequence tags. 

3. The method of claim 1 wherein said at least one 
restriction endonuclease is a four-cutter restriction endonu- 65 
clease which leaves a four-nucleotide protruding strand after 
cleavage. 



4. The method of claim 1 wherein said step of enzymati- 
cally removing further includes cleaving said vector with 
one or more nucleases recognizing said first lis restriction 
site and said second type lis restriction site to form a 
linearized vector having said segments of nucleotides at 
each end. 

5. The method of claim 4 wherein said step of enzymati- 
cally removing further includes re-circularizing said linear- 
ized vector to form said pair of sequence tags. 

6. A method of determining sequence frequencies in a 
population of polynucleotides, the method comprising the 
steps of: 

(a) providing a population of polynucleotides having 
predetermined ends; 

(b) inserting each polynucleotide of the population into a 
vector, the vector having at least one type Us restriction 
endonuclease recognition site adjacent to each end of 
the inserted polynucleotide, each type lis restriction 
endonuclease recognition site being oriented such that 
a type lis restriction endonuclease recognizing said 
sites cleaves the vector within to the inserted poly- 
nucleotide; 

(c) cleaving each vector with one or more type Us 
restriction endonu cleases recognizing the type lis 
restriction endonuclease recognition sites so that the 
vector is linearized and has a sequence tag of the 
inserted polynucleotide at each end; 

(d) re-circularizing the vector to form a pair of sequence 
tags for the inserted polynucleotide; and 

(e) determining the nucleotide sequence of each pair of 
sequence tags of a sample of re-circularized vectors to 
give the sequence frequencies of the population of 
polynucleotides. 
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7. The method of claim 6 further including the step of 
tabulating the pairs of nucleotide sequences of said sequence 
tags of said re-circularized vectors of said step (e) to form a 
frequency distribution of sequences in the population of 
polynucleotides. 

8. The method of claim 7 wherein said step of determining 
said nucleotide sequence of each of said pairs of said 
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sequence tags includes the steps of removing said pairs of 
said sequence tags from said re -circularized vectors of said 
sample, ligating the removed pairs of said sequence tags to 
form one or more concatenations of pairs, and sequencing 
the concatenations of pairs. 
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ABSTRACT 



The invention provides a method for constructing a high 
resolution physical map of a polynucleotide. In accordance 
with the invention, nucleotide sequences are determined at 
the ends of restriction fragments produced by a plurality of 
digestions with a plurality of combinations of restriction 
endonucleases so that a pair of nucleotide sequences is 
obtained for each restriction fragment. A physical map of the 
polynucleotide is constructed by ordering the pairs of 
sequences by matching the identical sequences among the 
pairs. 

6 Claims, 4 Drawing Sheets 
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DNA RESTRICTION SITE MAPPING 

FIELD OF THE INVENTION 

The invention relates generally to methods for construc- 
tion physical maps of DNA, especially genomic DNA, and 
more particularly, (o a method of providing high resolution 
physical maps by sequence analysis of concatenations of 
segments of restriction fragment ends. 

BACKGROUND 

Physical maps of one or more large pieces of DNA, such 
as a genome or chromosome, consist of an ordered collec- 
tion of molecular landmarks that may be used to position, or 
map, a smaller fragment, such as clone containing a gene of 
interest, within the larger structure, e.g. U.S. Department of 
Energy, "Primer on Molecular Genetics," from Human 
Genome 1991-92 Program Report; and Los Alamos 
Science, 20: 112-122 (1992). An important goal of the 
Human Genome Project has been to provide a series of 
genetic and physical maps of the human genome with 
increasing resolution, i.e. with reduced distances in base- 
pairs between molecular landmarks, e.g. Murray et al, 
Science, 265: 2049-2054 (1994); Hudson et al, Science, 
270: 1945-1954 (1995); Schulcr ct al, Science, 274: 
540-546 (1996); and so on. Such maps have great value not 
only in furthering our understanding of genome 
organization, but also as tools for helping to fill contig gaps 
in large-scale sequencing projects and as tools for helping to 
isolate disease -related genes in positional cloning projects, 
e.g. Rowen et al, pages 167-174, in Adams et al, editors, 
Automated DNA Sequencing and Analysis (Academic 
Press, New York, 1994); Collins, Nature Genetics, 9: 
347-350 (1995); Rossiter and Caskey, Annals of Surgical 
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Still another object of my invention is to provide a high 
resolution physical map of a target polynucleotide that 
permits directed sequencing of the target polynucleotide 
with the sequences of the map. 

Another object of my invention is to provide vectors for 
excising ends of restriction fragments for concatenation and 
sequencing. 

Still another object of my invent is to provide a method 
monitoring the expression of genes. 

A further object of my invention is to provide physical 
maps of genomic DNA that consist of an ordered collection 
of nucleotide sequences spaced at an average distance of a 
few hundred to a few thousand bases. 

My invention achieves these and other objects by provid- 
ing methods and materials for determining the nucleotide 
sequences of both ends of restriction fragments obtained 
from multiple enzymatic digests of a target polynucleotide, 
such as a fragment of a genome, or chromosome, or an insert 
of a cosmid, BAC, YAC, or the like. In accordance with the 
invention, a polynucleotide is separately digested with dif- 
ferent combinations of restriction endonucleases and the 
ends of the restriction fragments are sequenced so that pairs 
of sequences from each fragment are produced. A physical 
map of the polynucleotide is constructed by ordering the 
pairs of sequences by matching the identical sequences 
among such pairs resulting from all of the digestions. 

In the preferred embodiment, a polynucleotide is mapped 
by the following steps: (a) providing a plurality of popula- 
tions of restriction fragments, the restriction fragments of 
each population having ends defined by digesting the poly- 
nucleotide with a plurality of combinations of restriction 
endonucleases; (b) determining the nucleotide sequence of a 
portion of each end of each restriction fragment of each 



Oncology, 2: 14-25 (1995); and Schuler el al (cited above). popu i at i on so that a pair of nucleotide sequences is obtained 



In both cases, the ability to rapidly construct high-resolution 
physical maps of large pieces of genomic DNA is highly 
desirable. 

Two important approaches to genomic mapping include 
the identification and use of sequence tagged sites (STS's), 
e.g. Olson et al, Science, 245: 1434-1435 (1989); and Green 
et al, PGR Methods and Applications, 1: 77-90 (1991), and 
the construction and use of jumping and linking libraries, 
e.g. Collins et al, Proc. Natl. Acad. Sci., 81: 6812-6816 



for each restriction fragment of each population; and (c) 
ordering the pairs of nucleotide sequences by matching the 
nucleotide sequences between pairs to form a map of the 
polynucleotide. 

40 Another aspect of the invention is the monitoring gene 
expression by providing pairs of segments excised from 
cDNAs. In this embodiment, segments from each end of 
each cDNAof a population of cDNAs are ligated together to 
form pairs, which serve to identify their associated cDNAs. 



(1984); and Poustka and Lehrach, Trends in Genetics, 2: 45 Concatenations of such pairs are sequenced by conventional 



174—179 (1986). The former approach makes maps highly 
portable and convenient, as maps consist of ordered collec- 
tions of nucleotide sequences that allow application without 
having to acquire scarce or specialized reagents and librar- 
ies. The latter approach provides a systematic means for 
identifying molecular landmarks spanning large genetic 
distances and for ordering such landmarks via hybridization 
assays with members of a linking library. 

Unfortunately, these approaches to mapping genomic 



techniques to provide information on the relative frequen- 
cies of expression in the population. 

The invention provides a means for generating a high 
density physical map of target polynucleotides based on the 
50 positions of the restriction sites of predetermined restriction 
endonucleases. Such physical maps provide many 
advantages, including a more efficient means for directed 
sequencing of large DNA fragments, the positioning of 



expression sequence tags and cDNA sequences on large 
DNA are difficult and laborious to implement. It would be genomic fragments, such as BAC library inserts, thereby 
highly desirable if there was 'an approach for constructing ° •■• • 

physical maps that combined the systematic quality of the 
jumping and linking libraries with the convenience and 



portability of the STS approach. 

SUMMARY OF THE INVENTION 

Accordingly, an object of my invention is to provide 
methods and materials for constructing high resolution 
physical maps of genomic DNA. 



60 



making positional candidate mapping easier; and the like. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 graphically illustrates the concept of a preferred 
embodiment of the invention. 

FIG. 2 provides a diagram of a vector for forming pairs of 
nucleotide sequences in accordance with a preferred 
embodiment of the invention. 

FIG. 3 illustrates a scheme for carrying out the steps of a 



Another object of my invention is to provide a method of 65 preferred embodiment of the invention, 
ordering restriction fragments from multiple enzyme digests FIG. 4 illustrates locations on yeast chromosome 1 where 

by aligning matching sequences of their ends. sequence information is provided in a physical map based on 
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digestions with Hind III, Eco RI, and Xba I in accordance from multiple digestions of a polynucleotide are sequenced 

with the invention. and used to arrange the fragments into a physical map. Such 

DEFINITIONS a Physical ma P consists of an ordered collection of the 

nucleotide sequences of the segments immediately adjacent 

As used herein, the process of "mapping" a polynucle- 5 t0 t h e cleavage sites of the endonucleases used in the 

otide means providing a ordering, or series, of sequenced digestions. Preferably, after each digestion, segments are 

segments of the polynucleotide that correspond to the actual removed from the ends of each restriction fragment by 

ordering of the segments in the polynucleotide. For example, cleavage with a type lis restriction endonuclease. Excised 

the following set of five-base sequences is a map of the segments from the same fragment are ligated together to 

polynucleotide below (SEQ ID NO: 1), which has the form a pair of segments. Preferably, collections of such pairs 

ordered set of sequences making up the map underlined: 10 are concatenated by ligation, cloned, and sequenced using 

(gggtc, ttatt, aacct, catta, ccgga) conventional techniques. 

GTTGGGTCAACAAAITACCT^ ^ concept 0 f tne invention is illustrated in FIG. 1 for an 

CATTA G CCGGA GCCT embodiment which employs three restriction endonucleases: 

The term "oligonucleotide" as used herein includes linear r> qj and s Polynucleotide (50) has recognition sites (r if r 2 , 

oligomers of natural or modified monomers or linkages, 15 and r j for rcstr i ct i on endonucleases r, recognition sites 

including deoxyribonucleosides, nbonucleosides, and the ( thf0Ugh ^ for restriction endonuclease q, and recogni- 

hke, capable of specifically binding to a target polynucle- ^on sites (s, through s 5 ) for restriction endonuclease s. In 

otide by way of a regular pattern of monomer-to-monomer accordance with the preferred embodiment, polynucleotide 

interactions, such as Watson-Crick type of base pairing, base ( 50 ) ^ separately digested with r and s, q and s, and r and 

stacking, Hoogsteen or reverse Hoogsteen types of base 20 q t0 produce three populations of restriction fragments (58), 

pairing, or the like. Usually monomers are linked by phos- ( 60 ) and respectively. Segments adjacent to the ends 

phodiester bonds or analogs thereof to form oligonucleotides of each restriction fragment are sequenced to form sets of 

ranging in size from a few monomelic units, e.g. 3-4, to pairs (52), (54), and (56) of nucleotide sequences, which for 

several tens of monomeric units, e.g. 40-60. Whenever an sake of iUustration are shown directly beneath their corre- 

oligonucleotide is represented by a sequence of letters, such ^ sp0 nding restriction fragments in the correct order. Pairs of 

as "ATGCCTG/' it will be understood that the nucleotides sequences from all three sets are ordered by matching 

are in 5'-*3' order from left to right and that "A" denotes sequen ces between pairs as shown (70). A nucleotide 

deoxyadenosine, «C denotes deoxycytidine, "G" denotes Qce (?2) fom a fot ^ ^ matched ^ a nce 

deoxyguanosine, and T denotes thymidine, unless other- (?4) f secQnd ^ wfaose othef (?6) . . 

wise noted. Usually oligonucleotides comprise the four v ; . , ... r ,- ox c ... , • ^/ , 

* i **j i_ it i ™ matched with a sequence (78) of a third pair. The matching 

natural nucleotides; however, they may also comprise non- ^ 7 £ ' , . f ... 

natural nucleotide analogs. It is clear to those skilled in the COn u U ™ S ' 35 < 80 ) 18 mat ^ d Wlth . ^ < 84 > with (86), (88) 

art when oligonucleotides having natural or non-natural with (90), and so on, until the maximum number of pairs are 

nucleotides may be employed, e.g. where processing by included. It is noted that some pairs (92) do not contribute 

enzymes is called for, usually oligonucleotides consisting of t0 the ma P- These correspond to fragments having the same 

natural nucleotides are required. 35 restriction site at both ends. Tn other word, they correspond 

"Perfectly matched" in reference to a duplex means that to situations where there are two (or more) consecutive 

the poly- or oligonucleotide strands making up the duplex restriction sites of the same type without other sites in 

form a double stranded structure with one other such that between, e.g. S3 and s 4 in this example. Preferably, algo- 

every nucleotide in each strand undergoes Watson-Crick rithms used for assembling a physical map from the pairs of 

basepairing with a nucleotide in the other strand. The term 40 sequences can eliminate pairs having identical sequences, 

also comprehends the pairing of nucleoside analogs, such as Generally, a plurality of enzymes is employed in each 

deoxyinosine, nucleosides with 2-aminopurine bases, and digestion. Preferably, at least three distinct recognition sites 

the like, that may be employed. In reference to a triplex, the are used ^ can be accomplished by using three or more 

term means that the triplex consists of a perfecdy matched res triction endonucleases, such as Hind III, Eco RI, and Xba 

duplex and a third strand in which every nucleotide under- 45 T which nize differem nuclfi0tide sequences, or by 

goes Hoogsteen or reverse Hoogsteen association with a restriction endonucleases recognizing the same nucle- 

basepair or the perfectly matched duplex. , . ... joT * »u 1 *• 

te used herein, "nucleoside" includes the natural sequence, but which have different methy a ion sen- 

nucleosides, including 2'-deoxy and 2-hvdroxyl forms, e.g. s,hvltles ;, » > H «s undeistood that a different recogm- 

as described in Romberg and Baker, DNA Replication, 2nd Uon ate . ma y be ^rent solely by virtue of a dtfferent 

Ed. (Freeman, San Francisco, 1992). "Analogs" in reference so methylation state. Preferably a set of at least three recog- 

to nucleosides includes synthetic nucleosides having modi- mUon .endonucleases is employed in the method of the 

fied base moieties and/or modified sugar moieties, e.g. "^nlion. From this set a plurality of combinations of 

described by Scheit, Nucleotide Analogs (John Wiley, New restriction endonucleases is formed for separate digestion of 

York, 1980); Uhlman and Peyman, Chemical Reviews, 90: a tar 8 et P°lynucleoitde. Preferably, the combinations are 

543-584 (1990), or the like, with the only proviso that they 55 "n-1" combinations of the set. In other words, for a set of n 

are capable of specific hybridization. Such analogs include restriction endonucleases, the preferred combinations are all 

synthetic nucleosides designed to enhance binding th * combinations of n-1 restncUon endonucleases. For 

properties, reduce complexity, increase specificity, and the sample, as illustrated in FIG. 1 where a set of three 

i;^ e " ' restriction endonucleases (r, q, and s) are employed, the n-1 

As used herein, the term "complexity" in reference to a » combinations are (r, q), (r, s), and (q, s). Likewise, if four 

population of polynucleotides means the number of different restriction endonucleases (r, q, s, and w) are employed the 

species of polynucleotide present in the population. D '} combinations are (r, q, s), (r, q, w), (r, s w), and (q, s, 

w). It is readily seen that where a set of n restriction 

DETAILED DESCRIPTION OF THE endonucleases arc employed the plurality of n-1 combina- 

INVENTION 65 tons is n. 

In accordance with the present invention, segments of Preferably, the method of the invention is carried out 

nucleotides at each end of restriction fragments produced using a vector, such as that illustrated in FIG. 2. The vector 
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is readily constructed from commercially available materials 
using conventional recombinant DNA techniques, e.g. as 
disclosed in Sambrook et al, Molecular Cloning, Second 
Edition (Cold Spring Harbor Laboratory, New York, 1989). 
Preferably, pUC-based plasmids, such as pUC19, or X-based 5 
phages, such as X ZAP Express (Stratagene Cloning 
Systems, La Jolla, Calif.), or like vectors are employed. 
Important features of the vector are recognition sites (204) 
and (212) for two type lis restriction endonucleases that 
flank restriction fragment (208). For convenience, the two 10 
type lis restriction enzymes arc referred to herein as "DSj" 
and "Ilsj', respectively. Hsj and Hsj may be the same or 
different. Recognition sites (204) and (212) are oriented so 
that the cleavage sites of IlSi and Us^ are located in the 
interior of restriction fragment (208). In other words, taking 1$ 
the 5' direction as "upstream" and the 3' direction as 
"downstream," the cleavage site of Us 1 is downstream of its 
recognition site and the cleavage site of IIs 2 is upstream of 
its recognition site. Thus, when the vector is cleaved with 
IIs a and II&2 two segments (218) and (220) of restriction 2 o 
fragment (208) remain attached to the vector. The vector is 
then re -circularized by ligating the two ends together, 
thereby forming a pair of segments. If such cleavage results 
in one or more single stranded overhangs, i.e. one or more 
non-blunt ends, then the ends are preferably rendered blunt 2 s 
prior to re-circularization, for example, by digesting the 
protruding strand with a nuclease such as Mung bean 
nuclease, or by extending a 3' recessed strand, if one is 
produced in the digestion. The ligation reaction for 
re-circularization is carried out under conditions that favor 30 
the formation of covalent circles rather than concatemers of 
the vector. Preferably, the vector concentration for the 
ligation is between about 0.4 and about 4.0 jug/ml of vector 
DNA, e.g. as disclosed in Collins et al, Proc. Natl. Acad. 
Sci.,81: 6812-681 2 (1984), for X-based vectors. For vectors 35 
of different molecular weight, the concentration range is 
adjusted appropriately. 

In the preferred embodiments, the number of nucleotides 
identified depends on the "reach" of the type lis restriction 
endonucleases employed. "Reach" is the amount of separa- 40 
tion between a recognition site of a type lis restriction 
cndonuclcasc and its cleavage site, e.g. Brenner, U.S. Pat. 
No. 5,559,675. The conventional measure of reach is given 
as a ratio of integers, such as "(16/14)", where the numerator 
is the number of nucleotides from the recognition site in the 45 
5'-»3' direction that cleavage of one strand occurs and the 
denominator is the number of nucleotides from the recog- 
nition site in the 3'-*5' direction that cleavage of the other 
strand occurs. Preferred type lis restriction endonucleases 
for use as lls a and Ilsj in the preferred embodiment include 50 
the following: Bbv I, Bee 83 I, Beef I, Bpm I, Bsg I, BspLU 
11 III, Bst 71 I, Eco 57 I, Fok I, Gsu I, Hga I, Mme I, and 
the like. In the preferred embodiment, a vector is selected 
which does not contain a recognition site, other than (204) 
and (212), for the type lis enzyme(s) used to generate pairs 55 
of segments; otherwise, re-circularization cannot be carried 
out. 

Preferably, a type lis restriction endonuclease for gener- 
ating pairs of segments has as great a reach as possible to 
maximize the probability that the nucleotide sequences of 60 
the segments are unique. This in turn maximizes the prob- 
ability that a unique physical map can be assembled. If the 
target polynucleotide is a bacterial genome of 1 megabase, 
for a restriction endonuclease with a six basepair recognition 
site, about 250 fragments are generated (or about 500 ends) 65 
and the number of nucleotides determined could be as low 
as five or six, and still have a significant probability mat each 
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end sequence would be unique. Preferably, for polynucle- 
otides less than or equal to 10 megabases, at least 8 
nucleotides are determined in the regions adjacent to restric- 
tion sites, when a restriction endonuclease having a six 
basepair recognition site is employed. Generally for poly- 
nucleotides less than or equal to 10 megabases, 9-12 nucle- 
otides are preferably determined to ensure that the end 
sequences are unique. In the preferred embodiment, type lis 
enzymes having a (16/14) reach effectively provide 9 bases 
of unique sequence (since blunting reduces the number of 
bases to 14 and 5 bases are part of the recognition sites (206) 
or (210)). In a polynucleotide having a random sequence of 
nucleotides, a 9-mer appears on average about once every 
262,000 bases. Thus, 9-mer sequences are quite suitable for 
uniquely labeling restriction fragments of a target polynucle- 
otide corresponding to a typical yeast artificial chromosome 
(YACs) insert, i.e. 100-1000 kilobases, bacterial artificial 
chromosome (BAC) insert, i.e. 50-250 kilobases, and the 
like. 

Immediately adjacent to lis sites (204) and (212) are 
restriction sites (206) and (210), respectively that permit 
restriction fragment (208) to be inserted into the vector. That 
is, restriction site (206) is immediately downstream of (204) 
and (210) is immediately upstream of (212). Preferably, sites 
(204) and (206) are as close together as possible, even 
overlapping, provided type Us site (206) is not destroyed 
upon cleavage with the enzymes for inserting restriction 
fragment (208). llus is desirable because the recognition site 
of the restriction endonuclease used for generating the 
fragments occurs between the recognition site and cleavage 
site of type lis enzyme used to remove a segment for 
sequencing, i.e. it occurs within the "reach" of the type Us 
enzyme. Thus, the closer the recognition sites, the larger the 
piece of unique sequence can be removed from the fragment. 
The same of course holds for restriction sites (210) and 
(212). Preferably, whenever the vector employed is based on 
a pUC plasmid, restriction sites (206) and (210) are selected 
from either the restriction sites of polylinker region of the 
pUC plasmid or from the set of sites which do not appeal in 
the pUC. Such sites include Eco RI, Apo I, Ban II, Sac I, 
Kpn I, Acc65 I, Ava I, Xma I, Sma I, Bam HI, Xba I, Sal I, 
Hinc II, Acc I, BspMI, Pst I, Sse8387 I, Sph 1, Hind III, Afl 
II, Age I, Bspl20 I, Asc I, Bbs I, Bel I, Bgl II, Blp I, BsaA 
I, Bsa BI, Bse RI, Bsm I, Cla I, Bsp EI, BssH II, Bst BI, 
BstXI, Dra HI, Eag I, Eco RV, Fse I, Hpa I, Mfe I, Nae I, Nco 
I, Nhe I, Not I, Nru I, Pac I, Xho I, Pine I, Sac II, Spe I, Stu 
I, and the like. Preferably, six-nucleotide recognition sites 
(i.e. "6-cutters") are used, and more preferably, 6-cutters 
leaving four-nucleotide protruding strands are used. 

Preferably, the vectors contain primer binding sites (200) 
and (216) for primers p 2 and p 2 , respectively, which may be 
used to amplify the pair of segments by PCR after 
re-circularization. Recognition sites (202) and (214) are for 
restriction endonucleases Wj and w 2 , which are used to 
cleave the pair of segments from the vector after amplifi- 
cation. Preferably, w 2 and w 2 , which may be the same or 
different, are type lis restriction endonucleases whose cleav- 
age sites correspond to those of (206) and (210), thereby 
removing surplus, or non-informative, sequence (such as the 
recognition sites (204) and (212)) and generating protruding 
ends mat permit concatenation of the pairs of segments. 

FIG. 3 illustrates steps in a preferred method using vectors 
of FIG. 2. Genomic or other DNA (400) is obtained using 
conventional techniques, e.g. Herrmann and Frischauf, 
Methods in Enzymology, 152: 180-183 (1987); Frischauf, 
Methods in Enzymology, 152: 183-199 (1987), or the like, 
after which it is divided (302) into aliquots that are scpa- 
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rately digested (310) with combinations restriction 
endo nucleases, as shown in FIG. 3 for the n-1 combinations 
of the set of enzymes r, s, and q. Preferably, the resulting 
fragments are treated with a phosphatase to prevent ligation 
of the genomic fragments with one another before or during 
insertion into a vector. Restriction fragments are inserted 
(312) into vectors designed with cloning sites to specifically 
accept the fragments. That is, fragments digested with r and 
s are inserted into a vector that accepts r-s fragments. 
Fragments having the same ends, e.g. r-r and s-s, are not 
cloned since information derived from them does not con- 
tribute to the map. r-s fragments are of course inserted into 
the vector in both orientations. Thus, for a set of three 
restriction endonucleases, only three vectors are required, 
e.g. one each for accepting r-s, r-q, and s-q fragments. 
Likewise, for a set of four restriction endonucleases, e.g. r, 
s, q, and t, only six vectors are required, one each for 
accepting r-s, r-q, r-t, s-q, s-t, and q-t fragments. 

After insertion, a suitable host is transformed with the 
vectors and cultured, i.e. expanded (314), using conven- 
tional techniques. Transformed host cells are then selected, 
e.g. by plating and picking colonies using a standard marker, 
e.g. P-glactosidase/X-gal. A large enough sample of trans- 
formed host cells is taken to ensure that every restriction 
fragment is present for analysis with a reasonably large 
probability. This is similar to the problem of ensuring 
representation of a clone of a rare mRNAin a cDNA library, 
as discussed in Sambrook et al, Molecular Cloning, Second 
Edition (Cold Spring Harbor Laboratory, New York, 1989), 
and like references. Briefly, the number of fragments, N, that 
must be in a sample to achieve a given probability, P, of 
including a given fragment is the following: N=ln(l-P)/ln 
(1-f), where f is the frequency of the fragment in the 
population. Thus, for a population of 500 restriction 
fragments, a sample containing 3454 vectors will include at 
least one copy of each fragment (i.e. a complete set) with a 
probability of 99.9%; and a sample containing 2300 vectors 
will include at least one copy of each fragment with a 
probability of 99%. The table below provides the results of 
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15 



20 



25 



ers greater than about 200-300 basepairs are isolated and 
cloned (330) into a standard sequencing vector, such as 
M13. The sequences of the cloned concatenated pairs are 
analyzed on a conventional DNA sequencer, such as a model 
377 DNA sequencer from Perkin-Elmer Applied Biosystems 
Division (Foster City, Calif.). 

In the above embodiment, the sequences of the pairs of 
segments arc readily identified between sequences for the 
recognition site of the enzymes used in the digestions. For 
example, when pairs are concatenated from fragments of the 
r and s digestion after cleavage with a type lis restriction 
endonuclease of reach (16/14), the following pattern is 
observed (SEQ ID NO: 1): 

NNNNrrrrrrNN 
NNNNNNNNNNNNNNNNNqqqqqqNNNNNN . . . 
where "r" and "q" represent the nucleotides of the recogni- 
tion sites of restriction endonuclease r and q, respectively, 
and where the N's are the nucleotides of the pairs of 
segments. Thus, the pairs are recognized by their length and 
their spacing between known recognition sites. 

Pairs of segments are ordered by matching the sequences 
of segments between pairs. That is, a candidate map is built 
by selecting pairs that have one identical and one different 
sequence. The identical sequences are matched to form a 
candidate map, or ordering, as illustrated below for pairs (s a , 
sj, (S3, s 2 ), (s 3 , s„), (s 5 , s^), and (s 5 , s 6 ), where %'s" 
represent the nucleotide sequences of the segments: 



30 



s 3 - 



St— - s 2 

-s 2 



*4 

s 5 s 4 

% 
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Sequence matching and candidate map construction is 
readily carried out by computer algorithms, such as the 
Fortran code provided in Appendix A. Preferably, a map 



sizes: 



TABLE I 



Average fragment size Average fragment size 

after cleavage with after cleavage with 

2 six-cutters 3 six-cutters 

Size of Target (No, of fragments) (No. of fragments) 

Polynucleotide [Sample size for complete [Sample size for complete 

(basepairs) set with 99% probability] set with 99% probability] 



construction algorithm initially sorts the pairs to remove 
similar calculations for target polynucleotides of different 40 identical pairs prior to map construction. That is, preferably 

only one pair of each kind is used in the reconstruction. If 
for two pairs, (s,., s) and (s m , s„), s t -=s m and s^s^, then one 
of the two can be eliminated prior to map construction. As 
pointed out above, such additional pairs either correspond to 
45 restriction fragments such as (92) of FIG. 1 (no sites of a 
second or third restriction endonuclease in its interior) or 
they are additional copies of pairs (because of sampling) that 
can be used in the analysis. Preferably, an algorithm selects 
the largest candidate map as a solution, i.e. the candidate 
50 map that uses the maximal number of pairs. 

The vector of FIG. 2 can also be used for determining the 
frequency of expression of particular cDNAs in a cDNA 
_ _ ______ „^___ _ __ library. Preferably, cDNAs whose frequencies are to be 

After selection, the vector-containing hosts are combined determined are cloned into a vector by way of flanking 
and expanded in cultured. The vectors are then isolated, e.g. 55 restriction sites that correspond to those of (206) and (210). 
by a conventional mini-prep, or the like, and cleaved with Thus, cDNAs may be cleaved from the library vectors and 
Ilsj and IIs 2 (316). The fragments comprising the vector and directionally inserted into the vector of FIG. 2. After 
ends (i.e. segments) of the restriction fragment insert are insertion, analysis is carried out as described for the map- 
isolated, e.g. by gel electrophoresis, blunted (316), and P^g embodiment, except that a larger number of concate- 
re-circularized (320). The resulting pairs of segments in the 60 mersare sequenced in order to obtain a large enough sample 
re-circularized vectors are then ampliGed (322), e.g. by 
polymerase chain reaction (PCR), after which the amplified 



2.5 x 10 5 

5x 10 s 
1 x 10 6 



2048 (124) [576] 
2048 (250) [1050] 
2048 (500) [2300] 



1365 (250) [1050] 
1365 (500) [2300] 
1365 (1000) [4605] 



pairs are cleaved with w (324) to free the pairs of segments, 
which are isolated (326), e.g. by gel electrophoresis. The 
isolated pairs are concatenated (328) in a conventional 
ligation reaction to produce concatemers of various sizes, 
which are separated, e.g. by gel electrophoresis. Concatem- 



of cDNAs for reliable data on frequencies. 

EXAMPLE 1 



65 



Constructing a Physical Map of Yeast Chromosome 
1 with Hind III, Eco RI, and Xba I 

In this example, a physical map of the 230 kilobase yeast 
chromosome 1 is constructed using pUC19 plasmids modi- 
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fied in accordance with FIG. 2. The chromosome is sepa- separately digested to completion with Hind HI and Eco RI, 

rately digested to completion with the following combina- Hind 111 and Xba 1, and Eco Rl and Xba 1, respectively. For 

tions of enzymes: Hind III and Eco RI, Hind III and Xba I, each of the three populations, the same procedure is 

and Eco RI and Xba I to generate three populations of followed, which is described as follows for the pUC 19 

restriction fragments. Fragments from each population are 5 designed f°r H-E fragments. 

« • , f T * n , c . t ■ Since each enzyme recognizes a six basepair recognition 

inserted into separate pUC19 plasmids, one for each restnc- i_ , r « j j * * , i 

m , ' _ . . . sequence, about 100-140 fragments arc produced for a total 

tion fragment havmg different ends. That is, restriction ofaboul 3.3 pmol of fragments, about fifty percent of which 
fragments from the Hind III-Eco RI digestion are present in are H . E fragmentR . 5 . 2 6 fig (3 pmol) of plasmid DNA is 
three types, ones with a Hind Ill-digested end and an Eco 10 digested with Eco RI and Hind III in Eco RI buffer as 
Rl-digested end ("H-E" fragments), one with only Hind recommended by the manufacturer (New England Biolabs, 
Ill-digested ends ("H-H" fragments), and ones with only Beverly, Mass.), purified by phenol extraction and ethanol 
Eco Rl-digested fragments ("E-E" fragments). Likewise, precipitation, and ligated to the H-E fragments of the mix- 
restriction fragments from the Hind III-Xba I digestion are ture in a standard ligation reaction. A bacterial host is 
present in three types, ones with a Hind Ill-digested end and 15 transformed, e.g. by electroporation, and plated so that hosts 
an Xba I-digested end ("H-X" fragments), one with only containing recombinant plasmids are identified by white 
Hind Ill-digested ends ("H-H" fragments), and ones with colonies. The digestion of the yeast chromosome 1 generates 
only Xba T-digested fragments ("X-X" fragments). Finally, about 124 fragments of the three types, about fifty percent of 
restriction fragments from the Xba I-Eco RI digestion arc which are H-E fragments and about twenty-five percent each 
present in three types, ones with a Xba 1-digested end and an 20 are H-H or E-E fragments. About 290 colonies are picked for 
Eco Rl-digested end ("X-E" fragments), one with only Xba H " E fragments, and about 145 each are picked for H-H and 
I-digesled ends ("X-X" fragments), and ones with only Eco E * E fragments. The same procedure is carried out for all the 
RlKligested fragments ("E-E" fragments). Thus, the plasmid ^ *P CS of fragments so that six populations of trans- 
fer the Hind III-Eco RI digestion accepts H-E fragments; the *>™* * osts ™ ° bt f amed > one c eac u h for *' h > H ;*> X ' h > 

plasmid for the Hind III-Xba I digestion accepts H-X 25 """'f E ' and , X " X fT^- u ^ n ^m! 

c j ,t , j c . 1 ZrZ r r m r treated separately as follows: About 10 ug or plasmid DNA 

fragments: and the plasmid tor the Xba I-Eco RI digestion ... . *, . J . n t ■ ^ c , 

. V rr . . r. L i .j r is digested to completion with Bsg I using the manufactur- 

accepts X-E fragments. The construction of the plasmid for ef , s ^ (New £ land Riolabs> Beverl Mass ) and 

accepting H-E fragments is described below. The other aftcr phcnol cxtraction tnc vcctor/scgmcnt-containing frag- 

plasmids are construction in a similar manner. Syntheuc 3Q men t is isolated, e.g. by gel electrophoresis. The ends of the 

oligonucleotides (i) through (iv) are combined with a Eco I- isolated fragment are then blunted by Mung bean nuclease 

and Hind Ill-digested pUC19 in a ligation reaction so that ( U sing me manufacturer's recommended protocol, New 

they assemble into the double stranded insert of Formula I. England Biolabs), after which the blunted fragments are 

(i) 5 r -AATTAGCCGTACCTGCAGCAGTGCAGAAGC purified by phenol extraction and ethanol precipitation. The 
TTGCGT (SEQ ID NO: 2) 35 fragments are then resuspended in a ligation buffer at a 

(ii) 5 ' - AAACCTC AG AATTCCTGC ACAG CTG CG AAT concentration of about 0.05 //g/ml in 20 1-ml reaction 
CATTCG (SEQ ID NO - 3) volumes. The dilution is designed to promote self- ligation of 

(iii) 5^AGCTCGAATGATTCGCAGCTGTGCAGGAAT the fr a S ments > following the protocol of Collins et al, Proc. 
TCTGAG CSEO ID NO- 4} Natl ' Acad - Sci -> 81 : 6812-6816 (1984). After ligation and 

(iv) 5'-GTTTACGCAAGCTTCTGCACTGCTGCAGGT 40 concentration by ethanol precipitation, phages fron ithe 20 
acccct in MfV <^ reactions are combined. The parrs or segments carried by the 

^ ' ' plasmids are then amplified by PCR using primers p a and p 2 . 

The amplified product is purified by phenol extraction and 

Formula i (SEQ id no: 6) ethanol precipitation, after which it is cleaved with Bbv I 

___ ________ 45 us hig the manufacturer's recommended protocol (New 

-* — England Biolabs). After isolation by polyacrylamide gel 

Bby I Bsg I Hind ill electrophoresis, the pairs are concatenated by carrying out a 

5 • -aattagccgtacctgcagcagtgcagaagcttgcgtaaacctca- conventional ligation reaction. The concatenated fragments 

tcggcatggacgtcgtca cgtcttc gaacgc atttggagt - are then separated by polyacrylamide gel electrophoresis 

p t primer binding site 50 and concatemers greater than about 200 basepairs are iso- 

, latcd and Heated into an cquimolar mixture of three Phag- 

p2 primer banding site . . OT , /c ,„ . ^. . „ . 

-gaattcctgcacagctgcgaatcattcg escn P l SK sequencing vectors (Stratagene Clonmg Systems, 

-cttaaggacg tgtcgacgcttagtaagc tcga La Jolla, Calif), separately digested with Hind III, Eco RI, 

t t t and Hind HI and Eco RI, respectively. (Other appropriate 

eco ri Beg i Bbv i 55 mix^es and digestions are employed when different com- 
binations of enzymes are used). Preferably, a number of 

Note that the insert has compatible ends to the Eco RI-Hind clones are e *P™ ded ^d sequenced that ensure with a 

Ill-digested plasmid, but that the original Eco RI and Hind probability of at least 99% that all of die pairs of the ; aliquot 

III sites are destroyed upon ligation. Hie horizontal arrows a ' e sequenced^ lane of sequence data (about 600 bases) 

1 j u 1 *u T» t j til t * ■ 1 » 60 obtained with conventional sequencing provides the 

above and below the Bsg I and Bbv I sites indicate the r , , oc . ? , t%. n . 

,. £ . . *. ... - A sequences of about 25 pairs of segments. Thus, after 

direction of the cleavage site relative to the recognition si e tra 4 nsfection> about 13 ^ ^ ded and 

of the enzymes. After ligation transformation of a suitable sequeQced on a commercially available DNAsequencer, e.g. 

host, and expansion, the modified pUC 1 9 is isolated and the p£ Applied Biosystems model 377, to give the identities of 

insert is sequenced to confirm its identity. 65 a5out 325 pairs of seg ments. The other sets of fragments 

Yeast chromosome 1 DNA is separated into three aliquots require an additional 26 lanes of sequencing (13 each for the 

of about 5 jug DNA (0.033 pmol) each, which are then H-X and X-E fragments). 
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FIG. 4 illustrates the positions on yeast chromosome 1 of 
pairs of segments ordered in accordance with the algorithm 
of Appendix A. The relative spacing of the segments along 
the chromosome is only provided to show the distribution of 
sequence information along the chromosome. 

EXAMPLE 2 

Directed Sequencing of Yeast Chromosome 1 
Using Restriction Map Sequences as Spaced PCR 

Primers 



12 



APPENDIX A-continued 



Computer Code for Ordering Pairs into a Physical Map 



10 



c 
c 



c 
c 



15 



Tn this example, the 14-mer segments making up the 
physical map of Example 1 are used to separately amplify by 
PCR fragments that collectively cover yeast 1 chromosome. 
The PCR products are inserted into standard M13mpl9, or 
like, sequencing vectors and sequenced in both the forward 
and reverse directions using conventional protocols. For 
fragments greater than about 800 basepairs, the sequence 
information obtained in the first round of sequencing is used 2 o 
to synthesized new sets of primers for the next round of 
sequencing. Such directed sequencing continues until each 
fragment is completely sequenced. Based on the map of 
Example 1, 174 primers are synthesized for 173 PCRs. The 
total number of sequencing reactions required to cover yeast 2 5 
chromosome 1 depends on the distribution of fragment 
sizes, and particularly, how many rounds of sequencing are 
required to cover each fragment: the larger the fragment, the 
more rounds of sequencing that are required for full cover- 
age. Full coverage of a fragment is obtained when inspection 30 
of the sequence information shows that complementary 
sequences are being identified. Below, it is assumed that 
conventional sequencing will produce about 4O0 bases at 
each end of a fragment in each round. Inspection shows that 
the distribution of fragment sizes from the Example 1 map 35 
of yeast chromosome 1 are shown below together with 
reaction and primer requirements: 



121 
101 



1211 
1011 

c 



no 

c 
c 



character* 1 op(1000,2,l 4),w(l 4),x(l 4) 
character*! fp(l 000,2,1 4),test(l 4) 



open^file-'pl .dat',status- old<) 
opcn(5 ) filc- < oli3t.dat',status- t replace') 



nop=0 

read(l,100)nopl 
nop=nop + nopl 
do 101 j=l,nop 

rcad(l,102)(w(i),i-l,14), 
(x(k),k-l,14) 
do 121 kk-1,14 

opO',lW-w(kk) 

opG,2,kk)-x(kk) 

continue 
continue 
iead(l,100)nop2 
nop=nop + nop2 
do 1011 j-nopl+l,nop 

read(l,102)(w(i),i-l,14), 
(x0c),k-l,14) 
do 1211 kk-1,14 

opO',l,kk)=w(kk) 

op(j,2,kk)=x(kk) 
continue 
continue 

close(l) 

write(5,ll 0)nopl,nop2,nop 
format (3(2x^4)) 



op en(l,file=' p2.dat \status-'old') 
read(l,100)nop3 
nop»nop + nop3 
do 104 j«nopl+nop2+l,nop 
read(l,102)(w(i),i-l,14), 
(x<k),k-l,14) 
do 122 kk-1,14 



Round of 
Sequencing 


Fragment 
size range 


Number of 
Fragments 


Number of 
Seq. or PCR 
Primers 


Number of 
Sequencing 
Reactions 


40 


122 
104 


opG>l,kk)-w(kk) 

opG,2,kk)«.x(kk) 
continue 
continue 


1 


>0 


174 


174 


348 




c 


read(l,10U)uop4 


2 


>800 


92 


184 


184 


45 




nop»nop + nop4 


3 


>1600 


53 


106 


106 




do 1041 j=nopl+nop2+nop3+l,nop 


4 


>2400 


28 


56 


56 






read(l,102)(w(i),i-l,14), 


5 


>3200 


16 


32 


32 






(x(k),k=l,14) 


6 


>4000 


7 


14 


14 






do 1221 kk-1,14 


7 


>4800 


5 


10 


10 






opO,lj£k)-w(kk) 


8 


>5600 


1 


2 


2 






o P G\2,kk)-x(kk) 












50 


1221 


continue 




Total No. of Primers: 


578 


752 




1041 


continue 



Seq. reactions for map: 

Total No. of 
Reactions: 



39 



791 



This compares to about 2500-3000 sequencing reactions 
that are required for full coverage using shotgun sequencing. 

APPENDIX A 

Computer Code for Ordering Pairs into a Physical Mao 
program ops or t 
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1108 

c 

c 



60 



c 
c 
c 
c 



opsort reads ordered pairs from disk files 
p1 .dat, p2.dat, and p3.dat and sorts 
them into a physical map. 



65 



123 
105 
c 



close(l) 

write(5 > 1108)nopl > nop2,nop3,nop4,nop 
format(5(2x,i4)) 



op en(l,ffle-' p3.dat \status-old') 

read(l,100)nop5 

nop*»nop + nop5 

do 105 j-nopl+nop2+nop3+nop4+l,nop 

read(l,102)(w(i),i«i,14) 
(x(k),k-l,14) 
do 123 kk-1,14 

opG,lJck)-w(kk) 

op(j,2,kk)-x(kk) 

continue 
continue 



road(l,100)nop6 



* 

■J 
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APPENDIX A-contimied 



APPENDIX A-continued 



Computer Code for Ordering Pairs into a Physical Map 
nop=*nop + nop6 

do 1051 j«=nopl+nop2+nop3+nop4+nop5+l,nop 

read(l,102)(w(i)^l > 14), 
+ (x(k)^ol f 14) 
do 1231 kk-1,14 

op(j,l,kk)-w(kk) 

op(j,2,kk)-x(idc) 

continue 
continue 



1231 
1051 
c 



1109 

c 

c 

100 

102 

111 

c 

c 



120 

c 

c 



1100 
c 



c 

moo 



2100 



Computer Code for Ordering Pairs into a Physical Map 
write(*,1003) 
1003 format(lx, 'ne is gt 1') 

endif 
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close(l) 

write(5 ,1 1 09)nop 1 ,nop2,nop3 ) nop4 > nop5,nop 6,nop 
format(7(2x,i4)) 



format(i4) 

format(2(2x,14al)) 

format(/) 



write(5,lll) 
do 120 m=l.,nop 

write(5,102)(op(m,l ,i),i=l ,14), 

(op(m,2,k»l,14) 
write(*,102)(op(m,l,i),i-l,14), 
(op(mAk)>l,14) 

continue 



write^lll) 
do 1100 i=l,14 

test(i)=op(l,2,[) 

fp(l,14)=op(l,l,i) 
fp(l,2,i)=op(l,2,i) 
continue 

nxx«=nop 
ns«l 

continue 
ne=0 

do 2000 ix-2,nxx 
nt-0 

do 2000 ix-1,14 
if(test(jx).ne.op(ix,ljx)) then 

nt=nt+l 

endif 

continue 

if(nt.eq.0) then 
ns=ns4l 

ne=ne+l 
if(ne.gt.l) then 



2200 



15 



20 



25 



240U 



2300 



2000 



30 



c 
c 



35 



40 



1220 



45 



do 2200 kx-1,14 

fp(ns,l r kx)-op(ix, 1 Jsx) 
fp(ns,2,kx)=op (ix,2,kx) 
testCkxj-opfjx^Jcx) 
continue 

mm=0 
do 2300 mx=l,nxx 
if(mx.eq.ix) then 
goto 2300 

else 

Dim»min+1 
do 2400 ma«l,14 
op (mm,l ,ma)=op (mx, 1 ,ma) 
op(mm,2,ma)-op(mx,2,tna) 
continue 
endif 
continue 
endif 
continue 
nxx=nxx-l 
if(ne.ne.0) then 
goto 1000 
endif 



do 1220 m=l,ns 

write(5,102)(fp(m,l,i),i-l,14), 

(f P (m,2,k)^oi,i4) 
write(M02)(fp(m,l,i),i=l,14), 

(fc(mAk»l,14) 

continue 
write(*,100)ns 



close (5) 



end 



SEQUENCE LISTING 



(1) GENERAL INFORMATION: 

(iii) NUMBER OF SEQUENCES: 6 



(2) INFORMATION FOR SEQ ID NO: 1: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 40 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 1: 



NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 



40 



15 
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(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 36 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 2: 
AATTAGCCGT ACCTGCAGCA GTGCAGAAGC TTGCGT 36 



(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 36 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 
AAAC CTCAGA ATTCCTGCAC AGCTGCGAAT CATTCG 36 



(2) INFORMATION FOR SEQ ID NO: 4: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 36 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
AGCTCGAATG ATTCGCAGCT GTGCAGGAAT TCTGAG 36 



(2) INFORMATION FOR SEQ ID NO: 5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 36 nucleotides 
( D ) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 
GTTTACGCAA GCTTCTGCAC TGCTGCAGGT ACGGCT 36 



(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 72 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
AATTAGCCGT ACCTGCAGCA GTGCAGAAGC TTGCGTAAAC CTCAGAATTC 50 
CTGCACAGCT GCGAATCATT CG 72 



I claim: 60 
1. A method of mapping a polynucleotide, the method 
comprising the steps of: 

(a) providing a plurality of populations of restriction 
fragments, the restriction fragments of each population 
having an interior and ends defined by digesting the 
polynucleotide with a plurality of combinations of 65 
restriction endonucleases, and each restriction 'frag- 
ment being inserted into a vector; 



(b) cleaving each vector to remove the interior of the 
restriction fragment and to leave a segment of each end 
of the restriction fragment in the vector; 

(c) circularizing each vector so that the segments of each 
end of each restriction fragment are ligated together to 
form a pair of segments; 

(d) determining the nucleotide sequences of a sample of 
pairs of segments to obtain a sample of pairs of 
nucleotide sequences; and 
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(e) ordering the pairs of nucleotide sequences by match- 
ing the nucleotide sequences between pairs to form a 
map of the polynucleotide. 

2. The method of claim 1 wherein said step of determining 
said nucleotide sequences of said sample of said pairs of 
segments includes the steps of ligating said sample of pairs 
of segments from said plurality of populations to form one 
or more concatenations of pairs of segments, and sequencing 
the concatenations of pairs of segments. 

3. The method of claim 2 wherein said sample includes a 
number of said pairs of segments large enough so that with 
a probability of ninety-nine percent every possible kind of 
pair of segments is represented in said sample. 

4. The method of claim 3 wherein said step of cleaving is 
carried out with one or more type [Is restriction endonu- 
cleases. 

5. A method of analyzing gene expression in a cell or 
tissue, the method comprising the steps of: 

(a) forming a population of cDNA molecules from mRNA 
of a cell or tissue; 
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(b) determining the nucleotide sequence of a portion of 
each end of each cDNA molecule of the population so 
that a pair of nucleotide sequences is obtained for each 
cDNA of the population; and 

5 

(c) tabulating the pairs of nucleotide sequences to form a 
frequency distribution of gene expression in the cell or 
tissue. 

6. The method of claim 5 wherein said step of determining 
10 said nucleotide sequence of said end of each cDNA mol- 
ecule includes the steps of enzymatically removing a seg- 
ment of nucleotides from each said end; ligating the segment 
of nucleotides from each said end together to form a pair of 
15 segments, ligating a sample of pairs of segments from said 
population of cDNA molecules to form one or more con- 
catenations of pairs of segments, and sequencing the con- 
catenations of pairs of segments. 



